Sitemap strategy for large sites

Google first introduced Sitemaps in June 2005 and MSN and Yahoo announced support for this protocol in November 2006, Ask.com in April 2007.

You all may know what a sitemap is by now but what about the sitemap.xml of a large site with 5 million indexed pages adding more than 3k pages daily? It is not a simple one you need here, a different approach is required and this is what I want to lay on the table.

Make it easy to search engines

If your site has not sitemap.xml file listing all the pages to crawl they will be indexed if internal link structure is done properly or external links point to pages but, why not making it easy for search engines to crawl your site? Quite obvious.

Sitemaps are beneficial always but I would say 'required' for these cases:

  • Content or website is new and you want crawlers to get it all asap for a first indexing
  • The amount of content is huge and changes very frequently like online newspapers, auction, shopping, classifieds, events sites and so
  • A radical URL structure change for all the site, specially when getting rid of subdomains moving content to directories*

*Subdomains strategy is a waste of relevance in 99% of the cases unless you are somebody like ebay.com (140 million pages indexed) or amazon.com (104 millions).

Sitemap have some limits

Although sitemap files can be compressed using gzip to reduce bandwidth consumption they have a limit of 50,000 URLs and 10 megabytes per sitemap.

Hardly you will overstep this line, unless your site is one of those big ones but multiple sitemap files are supported so this can be achieved with a sitemap index file serving as an entry point for a total of 1000 sitemaps.

Let's check how big guys do it.

Large sites sitemaps analysis


EBAY (Spain)

ebayanuncios.es/ean_sitemap_4_index.xml.gz (sitemaps index for 3 sitemaps)
    ebayanuncios.es/ean_sitemap_4_global.xml.gz (categories, 5.770 URLs)
    ebayanuncios.es/ean_sitemap_4_1.xml.gz (content 40.000 URLs)
    ebayanuncios.es/ean_sitemap_4_2.xml.gz (content 40.000 URLs)

They use <lastmod> tag at sitemap ndex and <lastmod>, <changefreq> and <priority> at sitemaps.

AMAZON

Amazon's robots.txt refers to 10 different sitemaps or sitemaps indexes, some are grouped by subject. They do not care about maintaining any other tag but <loc>, no <lastmod>, <changefreq> or <priority>.

TV series
amazon.com/sitemap-manual-tv.xml (500 URLs)

Music Artists
amazon.com/sitemap_artists_index.xml (sitemaps index for 5 sitemaps)
    amazon.com/sitemap_artists_0001.xml.gz (50.000 URLs)
    ...

Books
amazon.com/sitemap_index_2.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_2000.xml.gz (50.000 URLs)
    ...

Sports & Outdoors, Jewelry...
amazon.com/sitemap_backfill_dp_index.xml (sitemaps index for 800 sitemaps)
    amazon.com/sitemap_backfill_dp_0001.xml.gz
    ...

Search results? (don't know, need further investigation)
amazon.com/sitemap_index_1.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_0.xml.gz (50.000 URLs)
    ...

CNN

Focuses upon helping a search engine find the addition of many new URLs daily. Here they use only <lastmod> tag.

cnn.com/sitemap_index.xml (sitemaps index for 36 sitemaps)
    cnn.com/sitemap_specials.xml (85 URLs)
    cnn.com/sitemap_month.xml (1.578 URLs)
    cnn.com/sitemap_week.xml (452 URLs)
    cnn.com/sitemap_topics_set_j_0.xml (3.700 URLs)
    ...

cnn.com/sitemap_news.xml (139 URLs)
News Sitemap specific tags for online newspapers in use here.

Sitemap strategy

Ok, you have to create a sitemap for a large site so take your time to think:

  • How your content is organised
    Does the categories number changes a lot?
    Maybe a sitemap for content structure is required. Just category URLs, no content pages.
  • Fresh content creation frequency
    How many new URLs are created by period of time?
    Determine a moment to distinguish fresh content from older one. One day? One week? One day limit example:
    First create sitemap just for the ones from today's midnight
    A second more large one containing URLs from today's midnight backwards.
  • Too much content
    Is the number of URLs exceeding the limits of sitemaps?
    Surpass the sitemap limits creating a larger sitemap index with more sitemaps subdividing URLs by content type, categories, time, or any other logical criteria.
  • Sitemap files creation
    For fresh content URLs make sitemaps dynamically updated on demand, it will consume few resources from server.
    For not so fresh URLs at big sitemaps files just a cron job or similar when required based on your decision to distinguish what is fresh. It can be done in low traffic hours for example and server won't suffer so much.

Some things to take in acount

Use <lastmod> tag specially at sitemaps indexes to let the search engines know how fresh are the changes.

List only available URLs, evident but sometimes missed. Rand Fishkin tells you several options to deal with expired content (number 2 my favourite).

Internal search engine driven sites: I would not create URLs for user searches they can be very inconsistent. Just structure/categories and content pages.

If the website can be included on Google News there are specific tags for the sitemap, use them. Warning, no URLs olther than 72 hours listed in sitemaps for news sites.

Do your sitemap submission thing

Allways ad the path at robots.txt with a line like this
Sitemap: http://www.example.org/sitemap.xml

Google
Use Webmastertools to submit the sitemap

Yahoo.com http://developer.yahoo.com/search/
http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://example.org/sitemap.xml

Live Search / Bing http://www.bing.com/webmaster/
http://www.bing.com/webmaster/ping.aspx?siteMap=http://example.org/sitemap.xml

Ask.com
http://submissions.ask.com/ping?sitemap=http://example.org/sitemap.xml

As usual:
Any comment to this sitemaps strategy?
Did I missed something or am I completely wrong?

Jun 02, 2009
Written by:
Filed under: SEO







12 comments
Jun 03, 2009
Posted by:
Radicke #1

I wonder if those large sites ever delete single articles from those sitemap structures of if they keep all old post in there...

Especially for newspapers, sometimes you need to delete old articles and search for them deep in your sitemap-index-structure can be a bit tedious...

Jun 07, 2009
Posted by:
Ani Lopez #2

Hi David, thanks for come and comment. There are few options to manage 'expired content' but in a newspaper it is not expired, it becomes part of the library.

If content is too old quite sure there is no reason to include it at sitemaps.xml files but accesible using internal search or similar.

Anyway I'll try to ask the ones I have some contact.

Oct 08, 2010
Posted by:
Daan #3

Does anyone know what the maximum is of Sitemap Writer Pro? We are working with Magento on a webshop for 700.000 products. Can we use that tool for this?

Oct 26, 2010
Posted by:
BulkDog #4

I have a website with millions of links. I just have 2 .xml.gz sitemaps with 50k latest url each and updated every 5 hour. If i create about 20 sitemap, is it will be effected to indexed page? For now google only indexed about 140k urls.
Regards

Oct 27, 2010
Posted by:
Ani Lopez #5

Hi Buldog,
Take into account that one thing is to make available all these URLs for the search to be indexed, what is a must, but another different one is how many of these URLs Google decides to index.

This last one it is directly related to the relevance your site has for Google. It was explained by Matt Cutts in one of his GoogleWebmasterHelp videos http://www.youtube.com/user/GoogleWebmasterHelp#grid/user/841CB8F9F31BF5D5

Cheers

Nov 05, 2010
Posted by:
Rakesh9 #6

So do those large websites just delete the expired urls from the existing sitemaps and update them???

Nov 07, 2010
Posted by:
Ani Lopez #7

Hi Rakesh9,
The xml sitemap strategy must reflect what has been decided to happen to expired content but it is a complete different story, a good topic for a new article.

Cheers

Jun 18, 2011
Posted by:
Mark Carter #8

Hi there ... many thanks for this. I'd also be very interested to see what you have to say on handling expired products and sitemap.xml's. For a large site this can be a considerable problem - especially if there is no automatic way to recreate the sitemap from the database directly.

Oct 23, 2011

Good stuff. Even though it's an 'old post' it's still valueable information!

Jan 19, 2012
Posted by:
Jay #10

Can anyone recommend a good 3rd party site map builder that will handle 10 million urls? Also one that doesnt cost too much, we have been looking but unable to find a reliable and cost effective provider. thanks!

Jan 19, 2012
Posted by:
Ani Lopez #11

Hi Jay, it must be a huge site if it has 10 million URLs!
When you reach that level a 3rd party is never a solution, your CMS should be the one creating and updating the xml sitemap files you need.

Unfortunately I don't know any software that could crawl 10 million URLs and create the files for you but it could be done by directories to lower the numbers in smaller chunks.

Jan 27, 2012
Posted by:
nobody #12

Guys, how to create multiple sitemaps? any naming pattern?
cheers

Have your say
Submit
twitter @anilopez

Articles I write for other sites

Some decisions to take beforehand on multilingual SEO @ Cardinal Path blog

Hold on! Sure, you’re excited to get your content online, but stop asking for a site to be built, and think about its audience especially when it comes to international SEO.

Challenges of Spanish Language on Search Marketing @ Multilingual Search

'Standard Spanish' is something that I don’t buy into when it comes to international scenarios. I'll explain to you why and some tips to start facing correctly your Spanish strategy.

50 feeds keeping me updated on SEO

Besides the feeds listed here I follow around 30 more for topics like Analytics, UX, Link building, IA or any other SEO related discipline

Handling Multilingual Sites for Humans & Search Engines @ Bruce Clay Blog

The logic behind the scenes to show all content to bots and the right language to users

Mobile detection issues & Google Instant Previews @ Cardinal Path blog

Mobile web represents the bigger headache ever for those wanting to target the small but growing audience they represent nowadays. check your Instant Previews for possible indexation issues.

Content for e-commerce, the SEO perspective

Covering the basics of content for online shops in this post. All the texts you must have according for every page type.