Sitemap strategy for large sites
Google first introduced Sitemaps in June 2005 and MSN and Yahoo announced support for this protocol in November 2006, Ask.com in April 2007.
You all may know what a sitemap is by now but what about the sitemap.xml of a large site with 5 million indexed pages adding more than 3k pages daily? It is not a simple one you need here, a different approach is required and this is what I want to lay on the table.
Make it easy to search engines
If your site has not sitemap.xml file listing all the pages to crawl they will be indexed if internal link structure is done properly or external links point to pages but, why not making it easy for search engines to crawl your site? Quite obvious.
Sitemaps are beneficial always but I would say 'required' for these cases:
- Content or website is new and you want crawlers to get it all asap for a first indexing
- The amount of content is huge and changes very frequently like online newspapers, auction, shopping, classifieds, events sites and so
- A radical URL structure change for all the site, specially when getting rid of subdomains moving content to directories*
*Subdomains strategy is a waste of relevance in 99% of the cases unless you are somebody like ebay.com (140 million pages indexed) or amazon.com (104 millions).
Sitemap have some limits
Although sitemap files can be compressed using gzip to reduce bandwidth consumption they have a limit of 50,000 URLs and 10 megabytes per sitemap.
Hardly you will overstep this line, unless your site is one of those big ones but multiple sitemap files are supported so this can be achieved with a sitemap index file serving as an entry point for a total of 1000 sitemaps.
Let's check how big guys do it.
Large sites sitemaps analysis
ebayanuncios.es/ean_sitemap_4_index.xml.gz (sitemaps index for 3 sitemaps)
ebayanuncios.es/ean_sitemap_4_global.xml.gz (categories, 5.770 URLs)
ebayanuncios.es/ean_sitemap_4_1.xml.gz (content 40.000 URLs)
ebayanuncios.es/ean_sitemap_4_2.xml.gz (content 40.000 URLs)
They use <lastmod> tag at sitemap ndex and <lastmod>, <changefreq> and <priority> at sitemaps.
Amazon's robots.txt refers to 10 different sitemaps or sitemaps indexes, some are grouped by subject. They do not care about maintaining any other tag but <loc>, no <lastmod>, <changefreq> or <priority>.
amazon.com/sitemap-manual-tv.xml (500 URLs)
amazon.com/sitemap_artists_index.xml (sitemaps index for 5 sitemaps)
amazon.com/sitemap_artists_0001.xml.gz (50.000 URLs)
amazon.com/sitemap_index_2.xml (sitemaps index for 1.000 sitemaps)
amazon.com/sitemap_page_2000.xml.gz (50.000 URLs)
Sports & Outdoors, Jewelry...
amazon.com/sitemap_backfill_dp_index.xml (sitemaps index for 800 sitemaps)
Search results? (don't know, need further investigation)
amazon.com/sitemap_index_1.xml (sitemaps index for 1.000 sitemaps)
amazon.com/sitemap_page_0.xml.gz (50.000 URLs)
Focuses upon helping a search engine find the addition of many new URLs daily. Here they use only <lastmod> tag.
cnn.com/sitemap_index.xml (sitemaps index for 36 sitemaps)
cnn.com/sitemap_specials.xml (85 URLs)
cnn.com/sitemap_month.xml (1.578 URLs)
cnn.com/sitemap_week.xml (452 URLs)
cnn.com/sitemap_topics_set_j_0.xml (3.700 URLs)
cnn.com/sitemap_news.xml (139 URLs)
News Sitemap specific tags for online newspapers in use here.
Ok, you have to create a sitemap for a large site so take your time to think:
- How your content is organised
Does the categories number changes a lot?
Maybe a sitemap for content structure is required. Just category URLs, no content pages.
- Fresh content creation frequency
How many new URLs are created by period of time?
Determine a moment to distinguish fresh content from older one. One day? One week? One day limit example:
First create sitemap just for the ones from today's midnight
A second more large one containing URLs from today's midnight backwards.
- Too much content
Is the number of URLs exceeding the limits of sitemaps?
Surpass the sitemap limits creating a larger sitemap index with more sitemaps subdividing URLs by content type, categories, time, or any other logical criteria.
- Sitemap files creation
For fresh content URLs make sitemaps dynamically updated on demand, it will consume few resources from server.
For not so fresh URLs at big sitemaps files just a cron job or similar when required based on your decision to distinguish what is fresh. It can be done in low traffic hours for example and server won't suffer so much.
Some things to take in acount
Use <lastmod> tag specially at sitemaps indexes to let the search engines know how fresh are the changes.
List only available URLs, evident but sometimes missed. Rand Fishkin tells you several options to deal with expired content (number 2 my favourite).
Internal search engine driven sites: I would not create URLs for user searches they can be very inconsistent. Just structure/categories and content pages.
If the website can be included on Google News there are specific tags for the sitemap, use them. Warning, no URLs olther than 72 hours listed in sitemaps for news sites.
Do your sitemap submission thing
Allways ad the path at robots.txt with a line like this
Use Webmastertools to submit the sitemap
Live Search / Bing http://www.bing.com/webmaster/
Any comment to this sitemaps strategy?
Did I missed something or am I completely wrong?