Sitemap strategy for large sites

Google first introduced Sitemaps in June 2005 and MSN and Yahoo announced support for this protocol in November 2006, Ask.com in April 2007.

You all may know what a sitemap is by now but what about the sitemap.xml of a large site with 5 million indexed pages adding more than 3k pages daily? It is not a simple one you need here, a different approach is required and this is what I want to lay on the table.

Make it easy to search engines

If your site has not sitemap.xml file listing all the pages to crawl they will be indexed if internal link structure is done properly or external links point to pages but, why not making it easy for search engines to crawl your site? Quite obvious.

Sitemaps are beneficial always but I would say 'required' for these cases:

  • Content or website is new and you want crawlers to get it all asap for a first indexing
  • The amount of content is huge and changes very frequently like online newspapers, auction, shopping, classifieds, events sites and so
  • A radical URL structure change for all the site, specially when getting rid of subdomains moving content to directories*

*Subdomains strategy is a waste of relevance in 99% of the cases unless you are somebody like ebay.com (140 million pages indexed) or amazon.com (104 millions).

Sitemap have some limits

Although sitemap files can be compressed using gzip to reduce bandwidth consumption they have a limit of 50,000 URLs and 10 megabytes per sitemap.

Hardly you will overstep this line, unless your site is one of those big ones but multiple sitemap files are supported so this can be achieved with a sitemap index file serving as an entry point for a total of 1000 sitemaps.

Let's check how big guys do it.

Large sites sitemaps analysis


EBAY (Spain)

ebayanuncios.es/ean_sitemap_4_index.xml.gz (sitemaps index for 3 sitemaps)
    ebayanuncios.es/ean_sitemap_4_global.xml.gz (categories, 5.770 URLs)
    ebayanuncios.es/ean_sitemap_4_1.xml.gz (content 40.000 URLs)
    ebayanuncios.es/ean_sitemap_4_2.xml.gz (content 40.000 URLs)

They use <lastmod> tag at sitemap ndex and <lastmod>, <changefreq> and <priority> at sitemaps.

AMAZON

Amazon's robots.txt refers to 10 different sitemaps or sitemaps indexes, some are grouped by subject. They do not care about maintaining any other tag but <loc>, no <lastmod>, <changefreq> or <priority>.

TV series
amazon.com/sitemap-manual-tv.xml (500 URLs)

Music Artists
amazon.com/sitemap_artists_index.xml (sitemaps index for 5 sitemaps)
    amazon.com/sitemap_artists_0001.xml.gz (50.000 URLs)
    ...

Books
amazon.com/sitemap_index_2.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_2000.xml.gz (50.000 URLs)
    ...

Sports & Outdoors, Jewelry...
amazon.com/sitemap_backfill_dp_index.xml (sitemaps index for 800 sitemaps)
    amazon.com/sitemap_backfill_dp_0001.xml.gz
    ...

Search results? (don't know, need further investigation)
amazon.com/sitemap_index_1.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_0.xml.gz (50.000 URLs)
    ...

CNN

Focuses upon helping a search engine find the addition of many new URLs daily. Here they use only <lastmod> tag.

cnn.com/sitemap_index.xml (sitemaps index for 36 sitemaps)
    cnn.com/sitemap_specials.xml (85 URLs)
    cnn.com/sitemap_month.xml (1.578 URLs)
    cnn.com/sitemap_week.xml (452 URLs)
    cnn.com/sitemap_topics_set_j_0.xml (3.700 URLs)
    ...

cnn.com/sitemap_news.xml (139 URLs)
News Sitemap specific tags for online newspapers in use here.

Sitemap strategy

Ok, you have to create a sitemap for a large site so take your time to think:

  • How your content is organised
    Does the categories number changes a lot?
    Maybe a sitemap for content structure is required. Just category URLs, no content pages.
  • Fresh content creation frequency
    How many new URLs are created by period of time?
    Determine a moment to distinguish fresh content from older one. One day? One week? One day limit example:
    First create sitemap just for the ones from today's midnight
    A second more large one containing URLs from today's midnight backwards.
  • Too much content
    Is the number of URLs exceeding the limits of sitemaps?
    Surpass the sitemap limits creating a larger sitemap index with more sitemaps subdividing URLs by content type, categories, time, or any other logical criteria.
  • Sitemap files creation
    For fresh content URLs make sitemaps dynamically updated on demand, it will consume few resources from server.
    For not so fresh URLs at big sitemaps files just a cron job or similar when required based on your decision to distinguish what is fresh. It can be done in low traffic hours for example and server won't suffer so much.

Some things to take in acount

Use <lastmod> tag specially at sitemaps indexes to let the search engines know how fresh are the changes.

List only available URLs, evident but sometimes missed. Rand Fishkin tells you several options to deal with expired content (number 2 my favourite).

Internal search engine driven sites: I would not create URLs for user searches they can be very inconsistent. Just structure/categories and content pages.

If the website can be included on Google News there are specific tags for the sitemap, use them. Warning, no URLs olther than 72 hours listed in sitemaps for news sites.

Do your sitemap submission thing

Allways ad the path at robots.txt with a line like this
Sitemap: http://www.example.org/sitemap.xml

Google
Use Webmastertools to submit the sitemap

Yahoo.com http://developer.yahoo.com/search/
http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://example.org/sitemap.xml

Live Search / Bing http://www.bing.com/webmaster/
http://www.bing.com/webmaster/ping.aspx?siteMap=http://example.org/sitemap.xml

Ask.com
http://submissions.ask.com/ping?sitemap=http://example.org/sitemap.xml

As usual:
Any comment to this sitemaps strategy?
Did I missed something or am I completely wrong?

Jun 02, 2009
Posted By: Ani López
Filed under: SEO
2 comments
Jun 03, 2009
Posted by:
Radicke #1

I wonder if those large sites ever delete single articles from those sitemap structures of if they keep all old post in there...

Especially for newspapers, sometimes you need to delete old articles and search for them deep in your sitemap-index-structure can be a bit tedious...

Jun 07, 2009
Posted by:
Ani Lopez #2

Hi David, thanks for come and comment. There are few options to manage 'expired content' but in a newspaper it is not expired, it becomes part of the library.

If content is too old quite sure there is no reason to include it at sitemaps.xml files but accesible using internal search or similar.

Anyway I'll try to ask the ones I have some contact.

Have your say
Submit
twitter @anilopez
IIMA Proffesional Menber Certified Professional

Featured articles

Importance of server security and some other web technical aspects in SEO

Server security and SEO

Don't let hackers put their hands on your site hosting account or your rankings are going to run down as waterfalls. Even worst, don't hack yourself or you will end up working in the chain gang of SEOs. Read More

Bookmark dispatcher

Design, Google Browser Size and Google Analytics

New Google tool 'Browser Size' helps web designers to ensure that important parts of a page's user interface are visible

Content strategy: Writing for kairos

'kairos' principle for the content of the web & social media. Alistapart.com guys nail it again

Geositemap KML generator

Arjan Snaterse launched handy KML & Geo sitemap generator, lifesaver for all your local search activities

Keyword Research for Social Media

keyword research and analysis is critical to achieving success with organic and paid search but how to do it effective for social media marketing? Worth reading.

Understanding and implementing web usability

No matter how many visitors your web has if the usability of your site is horrible. This means conversion zero.
Please, do not leave your usability unattended. Here you have some nice resources.

Some facts and figures about correlation between SEO factors and results

Another SEOmoz great great great article. Charts & Math stuff based on data from their web index Linkscape

Search Engine Ranking Factors '09

SEOmoz's biennial search engine ranking factors survey based on the opinions of 72 of the world's top search engine experts

SEO is unpredictable

People don't notice but it is out there.

On Beyond Keyword Research

Interesting presentation about advanced SEO Tactics by the yellow shoes guy Rand Fishkin. Handle with care.

Redirections are nothing without regular expressions

Regex are an essential part of any programmer but also of any SEO, get into it right now

Card sorting: a definitive guide

Card sorting is a technique that information architects (and related professionals as SEOs) use as an input to the structure of a site.

Micorformats arrive to Google

How many years waiting for that?
Sometimes they need to reinvent the wheel.

Local Search Ranking Factors Vol. 2

Local is really important (where available).
Are you going to miss that?

Beauty of statistics for a fact based world view

This is the first and only time I've seen some real utility on motion charts, since they appeared at Google Analytics

Dynamic SEO, Advanced White Hat

The guy at the blue hat SEO, walking on the edge again, shots a very interesting post about Dynamic SEO. It is simply the automated no-guessing self changing way of SEOing your site over time.

Big list of Search Marketing Blogs

Bored to read the same blogs regularly? Sure you can find something interesting in this list.

Update your Google Analytics script

Still using old 'urchin.js' script instead the new 'ga.js' in your Analytics implementation? Time to check your html code just in case.

Internet usage will overtake TV in 2010 says Microsoft

Advertisign landscape will change as usage will shift away from traditional PCs to game consoles, IPTV, and mobiles. Are you ready? Let's see how far goes this prediction mext year.

Tips to help your website convert more

Smashingmagazine is lately paying attention to some more issues than design. This is what makes it one of the best

Evaluating your content management

kitsite.com staff write a few but interesting articles about content management systems.

Why not Log file analysis?

Page tagging or Log based analytics? Both, there is useful information you can extract from logs about your web site. Don't miss it.

Google Trends/Insights vs Market Reports

As SEOs we have to predict the present and the near future. Several statistical models are explained and compared. Relative accuracy is determined for each model.