Sitemap strategy for large sites

Google first introduced Sitemaps in June 2005 and MSN and Yahoo announced support for this protocol in November 2006, Ask.com in April 2007.

You all may know what a sitemap is by now but what about the sitemap.xml of a large site with 5 million indexed pages adding more than 3k pages daily? It is not a simple one you need here, a different approach is required and this is what I want to lay on the table.

Make it easy to search engines

If your site has not sitemap.xml file listing all the pages to crawl they will be indexed if internal link structure is done properly or external links point to pages but, why not making it easy for search engines to crawl your site? Quite obvious.

Sitemaps are beneficial always but I would say 'required' for these cases:

  • Content or website is new and you want crawlers to get it all asap for a first indexing
  • The amount of content is huge and changes very frequently like online newspapers, auction, shopping, classifieds, events sites and so
  • A radical URL structure change for all the site, specially when getting rid of subdomains moving content to directories*

*Subdomains strategy is a waste of relevance in 99% of the cases unless you are somebody like ebay.com (140 million pages indexed) or amazon.com (104 millions).

Sitemap have some limits

Although sitemap files can be compressed using gzip to reduce bandwidth consumption they have a limit of 50,000 URLs and 10 megabytes per sitemap.

Hardly you will overstep this line, unless your site is one of those big ones but multiple sitemap files are supported so this can be achieved with a sitemap index file serving as an entry point for a total of 1000 sitemaps.

Let's check how big guys do it.

Large sites sitemaps analysis


EBAY (Spain)

ebayanuncios.es/ean_sitemap_4_index.xml.gz (sitemaps index for 3 sitemaps)
    ebayanuncios.es/ean_sitemap_4_global.xml.gz (categories, 5.770 URLs)
    ebayanuncios.es/ean_sitemap_4_1.xml.gz (content 40.000 URLs)
    ebayanuncios.es/ean_sitemap_4_2.xml.gz (content 40.000 URLs)

They use <lastmod> tag at sitemap ndex and <lastmod>, <changefreq> and <priority> at sitemaps.

AMAZON

Amazon's robots.txt refers to 10 different sitemaps or sitemaps indexes, some are grouped by subject. They do not care about maintaining any other tag but <loc>, no <lastmod>, <changefreq> or <priority>.

TV series
amazon.com/sitemap-manual-tv.xml (500 URLs)

Music Artists
amazon.com/sitemap_artists_index.xml (sitemaps index for 5 sitemaps)
    amazon.com/sitemap_artists_0001.xml.gz (50.000 URLs)
    ...

Books
amazon.com/sitemap_index_2.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_2000.xml.gz (50.000 URLs)
    ...

Sports & Outdoors, Jewelry...
amazon.com/sitemap_backfill_dp_index.xml (sitemaps index for 800 sitemaps)
    amazon.com/sitemap_backfill_dp_0001.xml.gz
    ...

Search results? (don't know, need further investigation)
amazon.com/sitemap_index_1.xml (sitemaps index for 1.000 sitemaps)
    amazon.com/sitemap_page_0.xml.gz (50.000 URLs)
    ...

CNN

Focuses upon helping a search engine find the addition of many new URLs daily. Here they use only <lastmod> tag.

cnn.com/sitemap_index.xml (sitemaps index for 36 sitemaps)
    cnn.com/sitemap_specials.xml (85 URLs)
    cnn.com/sitemap_month.xml (1.578 URLs)
    cnn.com/sitemap_week.xml (452 URLs)
    cnn.com/sitemap_topics_set_j_0.xml (3.700 URLs)
    ...

cnn.com/sitemap_news.xml (139 URLs)
News Sitemap specific tags for online newspapers in use here.

Sitemap strategy

Ok, you have to create a sitemap for a large site so take your time to think:

  • How your content is organised
    Does the categories number changes a lot?
    Maybe a sitemap for content structure is required. Just category URLs, no content pages.
  • Fresh content creation frequency
    How many new URLs are created by period of time?
    Determine a moment to distinguish fresh content from older one. One day? One week? One day limit example:
    First create sitemap just for the ones from today's midnight
    A second more large one containing URLs from today's midnight backwards.
  • Too much content
    Is the number of URLs exceeding the limits of sitemaps?
    Surpass the sitemap limits creating a larger sitemap index with more sitemaps subdividing URLs by content type, categories, time, or any other logical criteria.
  • Sitemap files creation
    For fresh content URLs make sitemaps dynamically updated on demand, it will consume few resources from server.
    For not so fresh URLs at big sitemaps files just a cron job or similar when required based on your decision to distinguish what is fresh. It can be done in low traffic hours for example and server won't suffer so much.

Some things to take in acount

Use <lastmod> tag specially at sitemaps indexes to let the search engines know how fresh are the changes.

List only available URLs, evident but sometimes missed. Rand Fishkin tells you several options to deal with expired content (number 2 my favourite).

Internal search engine driven sites: I would not create URLs for user searches they can be very inconsistent. Just structure/categories and content pages.

If the website can be included on Google News there are specific tags for the sitemap, use them. Warning, no URLs olther than 72 hours listed in sitemaps for news sites.

Do your sitemap submission thing

Allways ad the path at robots.txt with a line like this
Sitemap: http://www.example.org/sitemap.xml

Google
Use Webmastertools to submit the sitemap

Yahoo.com http://developer.yahoo.com/search/
http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://example.org/sitemap.xml

Live Search / Bing http://www.bing.com/webmaster/
http://www.bing.com/webmaster/ping.aspx?siteMap=http://example.org/sitemap.xml

Ask.com
http://submissions.ask.com/ping?sitemap=http://example.org/sitemap.xml

As usual:
Any comment to this sitemaps strategy?
Did I missed something or am I completely wrong?

Jun 02, 2009
Written by:
Filed under: SEO






21 comments
Jun 03, 2009
Posted by:
Radicke #1

I wonder if those large sites ever delete single articles from those sitemap structures of if they keep all old post in there...

Especially for newspapers, sometimes you need to delete old articles and search for them deep in your sitemap-index-structure can be a bit tedious...

Jun 07, 2009
Posted by:
Ani Lopez #2

Hi David, thanks for come and comment. There are few options to manage 'expired content' but in a newspaper it is not expired, it becomes part of the library.

If content is too old quite sure there is no reason to include it at sitemaps.xml files but accesible using internal search or similar.

Anyway I'll try to ask the ones I have some contact.

Oct 08, 2010
Posted by:
Daan #3

Does anyone know what the maximum is of Sitemap Writer Pro? We are working with Magento on a webshop for 700.000 products. Can we use that tool for this?

Oct 26, 2010
Posted by:
BulkDog #4

I have a website with millions of links. I just have 2 .xml.gz sitemaps with 50k latest url each and updated every 5 hour. If i create about 20 sitemap, is it will be effected to indexed page? For now google only indexed about 140k urls.
Regards

Oct 27, 2010
Posted by:
Ani Lopez #5

Hi Buldog,
Take into account that one thing is to make available all these URLs for the search to be indexed, what is a must, but another different one is how many of these URLs Google decides to index.

This last one it is directly related to the relevance your site has for Google. It was explained by Matt Cutts in one of his GoogleWebmasterHelp videos http://www.youtube.com/user/GoogleWebmasterHelp#grid/user/841CB8F9F31BF5D5

Cheers

Nov 05, 2010
Posted by:
Rakesh9 #6

So do those large websites just delete the expired urls from the existing sitemaps and update them???

Nov 07, 2010
Posted by:
Ani Lopez #7

Hi Rakesh9,
The xml sitemap strategy must reflect what has been decided to happen to expired content but it is a complete different story, a good topic for a new article.

Cheers

Jun 18, 2011
Posted by:
Mark Carter #8

Hi there ... many thanks for this. I'd also be very interested to see what you have to say on handling expired products and sitemap.xml's. For a large site this can be a considerable problem - especially if there is no automatic way to recreate the sitemap from the database directly.

Oct 23, 2011

Good stuff. Even though it's an 'old post' it's still valueable information!

Jan 19, 2012
Posted by:
Jay #10

Can anyone recommend a good 3rd party site map builder that will handle 10 million urls? Also one that doesnt cost too much, we have been looking but unable to find a reliable and cost effective provider. thanks!

Jan 19, 2012
Posted by:
Ani Lopez #11

Hi Jay, it must be a huge site if it has 10 million URLs!
When you reach that level a 3rd party is never a solution, your CMS should be the one creating and updating the xml sitemap files you need.

Unfortunately I don't know any software that could crawl 10 million URLs and create the files for you but it could be done by directories to lower the numbers in smaller chunks.

Jan 27, 2012
Posted by:
nobody #12

Guys, how to create multiple sitemaps? any naming pattern?
cheers

May 30, 2012

Is their any automated tool to generate site maps as soon as our new pages get updates?

May 30, 2012
Posted by:
Ani Lopez #14

#12: you don't really need a pattern to name sitemaps as long as you refer to them at the sitemaps index or Google/Bing WMT

#13: here you have a list of sitemap generators to use online, on your desktop, or to install in your server
http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

Jul 12, 2012
Posted by:
nobody #15

Thank you very much for this article. I have been assigned the task of building the site map for a large site where there are potentially millions of URLs, mostly based on search channels and modifiers. I have been able to confirm that I have taken the right approach.

With regards to expired webpages, in my case I am regenerating the site map daily, therefore removing outdated URLs. I do not have to crawl the site to produce the site map and instead able to take each single search result and programatically generate all URL combinations it falls under. It is a fairly quick process considering.

With regards to categories, I break the site map up into 4 core categories by priority; all-channels, new-channels, all-results, new-results, inevitably grouping a maximum 50,000 URLs per site map.

Thanks again.

Jul 12, 2012
Posted by:
Ani Lopez #16

Hi " ",
Trying to index millions of URLs, mostly based on search channels and modifiers is a really bad idea.
I've seen sites sinking just following that path.
Don't do it.

Jul 14, 2012
Posted by:
nobody #17

Hi Ani,

In our case it has been proven to be an effective method (the company has constructed site maps in this way for nearly 100 sites, since XML site maps were introduced). The type of site / its design on a UX level would probably need to be considered in arguing the side effects, especially since its a search based site.

I possibly exaggerated by saying millions, although there is still potential, we only index those URLs that are considered to be a higher priority (at this time). Also note that depending on the search channel or modifier, the surrounding/parent content is usually unique with minimal duplication.

If you have another reason why indexing in such a way would have a negative effect, I would love to know.

Regards.

Jul 14, 2012
Posted by:
nobody #18

Just to update on my comment "the type of site / its design on a UX level", I really meant to say the type of content.

Mar 26, 2013
Posted by:
Deepak HM #19

Hi Ani,

Very informative and rare post on sitemap strategy for large sites.

What is maximum number of sitemap files can we submit, we have submitted 1k urls in each sitemaps multiplied by 1K. That's around 10 lakh urls, but Google web master is showing only 2 lakhs.

Is their any limitation in the number of sitemap files we upload to Google webmaster tools?

If we have 23 Million urls how to split and how many sitemaps can we submit?

Aug 19, 2013
Posted by:
jamez #20

how can we make a sitemap a site like youtube ?

Mar 09, 2014
Posted by:
Rita #21

Hi Ani,
Even though is an old post, still very helpful relevant.
I hope you will find time to came back to this post and help me with my question.

A content publishing site with 5M pages. First of all I only have the last year content in the sitemap, every thing else that is not relevant i do not have in the sitemap anymore.
However, we also rank for news with the same content(articles) and we have a news sitemap.
The news sitemap contains the last 2 days articles and it updates automatically.
The news sitemap and last year sitemap overlaps with the last 2 days articles, would that be a problem?
Should I not overlap my article in the articles sitemap and news sitemap or it shouldn't be a problem in this case?
Thanks you very much!

Have your say
Submit
twitter @anilopez

Articles I write for other sites

On Paella and Semantic Markup for recipies

In plain words, it does not work fine most of the cases. It's a bad idea. I'll explain why while I teach you how to cook an authentic Paella.

Analytics Tribulations Of An SEO

The art of measure is never easy but when it comes to SEO it's even worst

Some decisions to take beforehand on multilingual SEO @ Cardinal Path blog

Hold on! Sure, you’re excited to get your content online, but stop asking for a site to be built, and think about its audience especially when it comes to international SEO.

Challenges of Spanish Language on Search Marketing @ Multilingual Search

'Standard Spanish' is something that I don’t buy into when it comes to international scenarios. I'll explain to you why and some tips to start facing correctly your Spanish strategy.

50 feeds keeping me updated on SEO

Besides the feeds listed here I follow around 30 more for topics like Analytics, UX, Link building, IA or any other SEO related discipline

Handling Multilingual Sites for Humans & Search Engines @ Bruce Clay Blog

The logic behind the scenes to show all content to bots and the right language to users

Mobile detection issues & Google Instant Previews @ Cardinal Path blog

Mobile web represents the bigger headache ever for those wanting to target the small but growing audience they represent nowadays. check your Instant Previews for possible indexation issues.