Sitemap strategy for large sites

Google first introduced Sitemaps in June 2005 and MSN and Yahoo announced support for this protocol in November 2006, Ask.com in April 2007.

You all may know what a sitemap is by now but what about the sitemap.xml of a large site with 5 million indexed pages adding more than 3k pages daily? It is not a simple one you need here, a different approach is required and this is what I want to lay on the table.

Make it easy to search engines

If your site has not sitemap.xml file listing all the pages to crawl they will be indexed if internal link structure is done properly or external links point to pages but, why not making it easy for search engines to crawl your site? Quite obvious.

Sitemaps are beneficial always but I would say 'required' for these cases:

  • Content or website is new and you want crawlers to get it all asap for a first indexing
  • The amount of content is huge and changes very frequently like online newspapers, auction, shopping, classifieds, events sites and so
  • A radical URL structure change for all the site, specially when getting rid of subdomains moving content to directories*

*Subdomains strategy is a waste of relevance in 99% of the cases unless you are somebody like ebay.com (140 million pages indexed) or amazon.com (104 millions).

Sitemap have some limits

Although sitemap files can be compressed using gzip to reduce bandwidth consumption they have a limit of 50,000 URLs and 10 megabytes per sitemap.

Hardly you will overstep this line, unless your site is one of those big ones but multiple sitemap files are supported so this can be achieved with a sitemap index file serving as an entry point for a total of 1000 sitemaps.

Let's check how big guys do it.

Large sites sitemaps analysis


EBAY (Spain)

ebayanuncios.es/ean_sitemap_4_index.xml.gz (sitemaps index for 3 sitemaps)
    ebayanuncios.es/ean_sitemap_4_global.xml.gz (categories, 5.770 URLs)
    ebayanuncios.es/ean_sitemap_4_1.xml.gz (content 40.000 URLs)
    ebayanuncios.es/ean_sitemap_4_2.xml.gz (content 40.000 URLs)

They use <lastmod> tag at sitemap ndex and <lastmod>, <changefreq> and <priority> at sitemaps.

AMAZON

Amazon's robots.txt refers to 10 different sitemaps or sitemaps indexes, some are grouped by subject. They do not care about maintaining any other tag but <loc>, no <lastmod>, <changefreq> or <priority>.

TV series
amazon.com/sitemap-manual-tv.xml (500 URLs)

Music Artists
amazon.com/sitemap_artists_index.xml (sitemaps index for 5 sitemaps)
    amazon.com/sitemap_artists_0001.xml.gz (50.000 URLs)
    ...

Books
amazon.com/sitemap_index_2.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_2000.xml.gz (50.000 URLs)
    ...

Sports & Outdoors, Jewelry...
amazon.com/sitemap_backfill_dp_index.xml (sitemaps index for 800 sitemaps)
    amazon.com/sitemap_backfill_dp_0001.xml.gz
    ...

Search results? (don't know, need further investigation)
amazon.com/sitemap_index_1.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_0.xml.gz (50.000 URLs)
    ...

CNN

Focuses upon helping a search engine find the addition of many new URLs daily. Here they use only <lastmod> tag.

cnn.com/sitemap_index.xml (sitemaps index for 36 sitemaps)
    cnn.com/sitemap_specials.xml (85 URLs)
    cnn.com/sitemap_month.xml (1.578 URLs)
    cnn.com/sitemap_week.xml (452 URLs)
    cnn.com/sitemap_topics_set_j_0.xml (3.700 URLs)
    ...

cnn.com/sitemap_news.xml (139 URLs)
News Sitemap specific tags for online newspapers in use here.

Sitemap strategy

Ok, you have to create a sitemap for a large site so take your time to think:

  • How your content is organised
    Does the categories number changes a lot?
    Maybe a sitemap for content structure is required. Just category URLs, no content pages.
  • Fresh content creation frequency
    How many new URLs are created by period of time?
    Determine a moment to distinguish fresh content from older one. One day? One week? One day limit example:
    First create sitemap just for the ones from today's midnight
    A second more large one containing URLs from today's midnight backwards.
  • Too much content
    Is the number of URLs exceeding the limits of sitemaps?
    Surpass the sitemap limits creating a larger sitemap index with more sitemaps subdividing URLs by content type, categories, time, or any other logical criteria.
  • Sitemap files creation
    For fresh content URLs make sitemaps dynamically updated on demand, it will consume few resources from server.
    For not so fresh URLs at big sitemaps files just a cron job or similar when required based on your decision to distinguish what is fresh. It can be done in low traffic hours for example and server won't suffer so much.

Some things to take in acount

Use <lastmod> tag specially at sitemaps indexes to let the search engines know how fresh are the changes.

List only available URLs, evident but sometimes missed. Rand Fishkin tells you several options to deal with expired content (number 2 my favourite).

Internal search engine driven sites: I would not create URLs for user searches they can be very inconsistent. Just structure/categories and content pages.

If the website can be included on Google News there are specific tags for the sitemap, use them. Warning, no URLs olther than 72 hours listed in sitemaps for news sites.

Do your sitemap submission thing

Allways ad the path at robots.txt with a line like this
Sitemap: http://www.example.org/sitemap.xml

Google
Use Webmastertools to submit the sitemap

Yahoo.com http://developer.yahoo.com/search/
http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://example.org/sitemap.xml

Live Search / Bing http://www.bing.com/webmaster/
http://www.bing.com/webmaster/ping.aspx?siteMap=http://example.org/sitemap.xml

Ask.com
http://submissions.ask.com/ping?sitemap=http://example.org/sitemap.xml

As usual:
Any comment to this sitemaps strategy?
Did I missed something or am I completely wrong?

Jun 02, 2009
Posted By: Ani López
Filed under: SEO
3 comments
Jun 03, 2009
Posted by:
Radicke #1

I wonder if those large sites ever delete single articles from those sitemap structures of if they keep all old post in there...

Especially for newspapers, sometimes you need to delete old articles and search for them deep in your sitemap-index-structure can be a bit tedious...

Jun 07, 2009
Posted by:
Ani Lopez #2

Hi David, thanks for come and comment. There are few options to manage 'expired content' but in a newspaper it is not expired, it becomes part of the library.

If content is too old quite sure there is no reason to include it at sitemaps.xml files but accesible using internal search or similar.

Anyway I'll try to ask the ones I have some contact.

May 28, 2010
Posted by:
Art #3

Try Sitemap Writer Pro.
It's most convenient tool for making large sitemaps, including sitemap index files.

Have your say
Submit
twitter @anilopez
Google Analytics Qualified Individual IIMA Proffesional Menber Certified Professional

Featured articles

How context in social media influence rankings in SEO

Social Media and SEO

Content optimization and keyword distribution beyond the website. Social media sites can create more room for your precious keywords and it seems to have very nice results. Here you have some numbers. Read More

Keywords distribution along web content structure

Keywords distribution along web content structure

What to do after your keyword research? Is the structure of the website to optimize good enough to place all these keywords? Keywords distribution and content structure are closely related. Read More

Importance of server security and some other web technical aspects in SEO

Server security and SEO

Don't let hackers put their hands on your site hosting account or your rankings are going to run down as waterfalls. Even worst, don't hack yourself or you will end up working in the chain gang of SEOs. Read More

Bookmark dispatcher

Most basic must have custom report

Some goals can be measured in currency, others not but you have to know the turn out of the effort made on the site

Unmask Parasites Blog

things that hackers already know and site owners should know if they don't want to be victims

Sitemaps XML can have now Images

Images can be a nice source of traffic so Yeah! this is a great improvement for SEO.

Web design for mobiles

New distribution devices, different limits. Learn the basics of it before start the mobile version of your web site or web service.

Measuring Javascript Parse and Load

Worried about how JavaScript load and parse affects your total page load? Carlos Bueno goes deep into it.

Accent Folding for Auto-Complete

Next step is to write sites that are not just “internationalized” but truly multilingual. No excuse for your software to play dumb when the user types 'cafe' instead of 'café'.

Web Content Strategy

Content planning, information architecture, or IA planning no matter how you call it but it is a big part of your SEO success. Do not leave your content strategy unattended.

Javascript based data visualization

No Flash or Silverlight required, just data, a Javascript library and a browser to display fantastic stats with beautiful charts. 16 libraries for visualizations.

SEO for multi-regional websites

Google shares some interesting tips to handle multi-regional websites

PHP script to detect mobile browsers

If you want to offer your visitors a different version of your site for mobile browsers this script makes detection very simple

Geo Location resources

Location is becoming more and more important everyday in Search Marketing. Some nice resources you can use in your strategies.

Design, Google Browser Size and Google Analytics

New Google tool 'Browser Size' helps web designers to ensure that important parts of a page's user interface are visible

Content strategy: Writing for kairos

'kairos' principle for the content of the web & social media. Alistapart.com guys nail it again

Geositemap KML generator

Arjan Snaterse launched handy KML & Geo sitemap generator, lifesaver for all your local search activities

Keyword Research for Social Media

keyword research and analysis is critical to achieving success with organic and paid search but how to do it effective for social media marketing? Worth reading.

Understanding and implementing web usability

No matter how many visitors your web has if the usability of your site is horrible. This means conversion zero.
Please, do not leave your usability unattended. Here you have some nice resources.

Some facts and figures about correlation between SEO factors and results

Another SEOmoz great great great article. Charts & Math stuff based on data from their web index Linkscape

Search Engine Ranking Factors '09

SEOmoz's biennial search engine ranking factors survey based on the opinions of 72 of the world's top search engine experts

SEO is unpredictable

People don't notice but it is out there.

On Beyond Keyword Research

Interesting presentation about advanced SEO Tactics by the yellow shoes guy Rand Fishkin. Handle with care.