Forum Moderators: goodroi

Sitemaps for large websites and SEO - add everything or just some?

Sitemaps for large websites. What is the optimum solution?

         

jmccormac

1:24 pm on Dec 5, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some large websites have high numbers (millions or more) webpages. This can create a problem with the sitemap files increasing with each new page added to the site. Some of the pages will be added once and probably not updated again. Each sitemap file has a recommended number of URLS of 50K. As a site grows, the number of sitemap files increases and search engines like Google continually downloading these files can take up a considerable amount of a site's bandwidth. There is also an SEO aspect with "stale" or evergreen content.

Are there any recommend strategies for dealing with sitemaps for large sites?

For SEO purposes, do current pages matter more than historical webpages and should the numbers of historical webpages (ones that haven't been updated for some time) be pruned from sitemaps or should old sitemap files be dropped entirely?

Can large numbers of sitemaps with historical URLs work against a site's SEO and ranking?

Regards...jmcc

lucy24

5:52 pm on Dec 5, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Stock question: What information is in the sitemap that is not deducible from ordinary spidering?

<mild digression>
I’ve seen sites where, after visiting some interior page found via a search engine, I left the page, explored the site ... and then was utterly unable to find my way back to the original page. If the explanation lies in a sitemap, then the site could be said to be feeding inaccurate information to the search engine, concealing the fact that its internal navigation isn't what it should be.
</md>

not2easy

6:07 pm on Dec 5, 2025 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Google spells out what and how here: [developers.google.com...]

But I can't say it is necessarily beneficial to the site to offer that information unless navigation isn't doing the job. Particularly when much of the content is legacy (not updated) content.

There are sites that offer evergreen information that has value though, so it may help to have a guide that is kept current. In other words, it depends.

jmccormac

6:40 pm on Dec 5, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Lucy24 The hosting history of about 850 million domain names since 2000 and the domain name transactions (new/deleted/transfers) stats for hosters (that's another 50M or so.) The transactions stats are historical. Of the domain name data, only about 271M are current. Much of the transactions pages can be followed from the navigation. Applebot took about 1M pages in November using this method. I've been working on more streamlined version of the pages for more recent transactions but haven't finished that yet. The navigation is extremely simplified (year/hoster/transaction pages with links to the other years in which the hoster is active. The domain data is a single query page. There are also some survey pagesand other stats pages.

@not2easy The stats are historical but evergreen in that they are providing a snapshot of a hoster's performance. One of the sets of pages that I have held off publishing is the official registry/registrars statistics for ICANN registrars from 2001 to present. It took a lot of work to reassemble them as the data quality and integrity (reports published as PDF copies of Excel spreadsheets and other exotic interpretations of data rather than CSV files which only came into common use in 2014). Generating the sitemaps isn't the problem. Thanks and will read though those links tonight. A directory style with page links might be a more efficient way of presenting some of the data and might cut down some of the URLs that would be included in the sitemaps.

Regards...jmcc

Kendo

11:17 pm on Dec 5, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A lot of what I read about indexing is absolute nonsense.

For example, while websites endeavour to get pages indexed, I have found that by merely visiting a web page using Chrome will get it indexed. That is something that we learned not to do when testing pre-release information.

And on the other end of that, while we get hundreds of emails about why certain pages were not indexed, we still have pages in their index that we removed from tour website a decade ago.

As for the hype about metrics... ROFL

tangor

5:37 am on Dec 6, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Never been convinced that site maps are more than make work pie-in-the-sky. Personally never used them. When CLIENTS insisted on site map(s) did it by the rules and rarely (if ever!) saw any benefit to the CLIENT and engagement with search engines.

What has ALWAYS made sense was accurate site navigation that reveals the content, from top to bottom. If that is in place a site map is superfluous.

jmccormac

12:47 pm on Dec 7, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Tangor The counter-argument to that with large sites is that a site map would help search engines crawl important pages directly rather than going through the discovery process with navigation and crawling everything it discovers. It may seem to be insignificant on small websites. On large websites, direct is better than having a search engine traversing a navigation path and crawling pages as it finds them. It aso allows the site owner to prioritise pages that are important. The site served about 3.7M pages in November. Most of those were crawls by Google, Bing and Applebot. Applebot was the only one using discovery.

With historical data, the number of queries that a SE would have to make increases and they can add up. Direct queries would be less expensive in database terms especially if the navigation is generated from the database and low priority pages excluded from crawling.

Regards...jmcc

lucy24

6:22 pm on Dec 7, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Shouldn't important pages be reachable within one or two steps of the front page anyway?

jmccormac

7:22 pm on Dec 7, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ideally. Sometimes the architecture of a site develops by accretion with pages added because it seemed like a good idea at the time. Most pages are within about two steps but users can also move horizontally as well as vertically. One of ideas that I picked up from a book on Information Architecture a while ago was that navigation is as much about streamlining and limiting user options as it is about enabling them to find what they want easily. Trying to guess what are important pages to a user is the difficult part and it has to be weighed against the queries involved in generating the pages.

Regards...jmcc

tangor

10:43 pm on Dec 8, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The site served about 3.7M pages in November.


Served, or does that mean the SITE had 3.7m unique URL pages each hit ONCE? (Meaning that's at least 74 50k url sitemaps.) Even with automation, that's a significant amount of work and ... also a distraction for g (and others) to do the sitemap first before churning through all they would have done in the first place---and doing the same thing up to their crawl budget per site.

On REALLY large sites (to me that is over 500k UNIQUE urls) a sitemap is useful for the more obscure pages or the LATEST and GREATEST. A sitemap for WHAT'S NEW makes sense ... and the What's New sitemap is webmaster MODERATED on a bi-monthly basis to keep the NEW listed and dropping off the old "new" to keep things mean and lean. G is more likely to see that (a single sitemap) as useful instead of 74-50,000...

jmccormac

1:10 am on Dec 9, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Would have to check the logs for the distinct count. That 3.7M was the served pages number. On the number of unique URLs, the domain name stuff is just about over 900M in the last few months. That's the hosting history of domain names since 2000. nly about 271M are currently active.

The stats/transactions pages would add more to that in terms of unique URLs. Was looking at a way of doing a more compressed sitemap (directory style pages rather than individual URLs). This is because there's about 3.6M new in one gTLD every month. Some of them already exist (reregistrations) so that figure is slightly lower. Then there are the deleted domain names (2.8M). A smaller sitemap with directory pages might be the best approach for a latest/current sitemap. Applebot cannot access the sitemaps so it has to discover pages.

Regards...jmcc

tangor

9:37 pm on Dec 12, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmm... just taking the base number of 900M that's 18,000 NO CONTENT URL LISTS. Not sure any search engine will look at that as a useful expenditure of crawl budget...

On the other hand, given the example data, large directory pages (by region, for example) could be very useful for USERS, but once again might fall foul with g. Somewhere in my fuzzy memory I have an old adage that pages with more than 100 links are frowned upon by search engines.

Is the site rendered static or dynamic?

jmccormac

7:42 pm on Dec 15, 2025 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Dynamic on an Athlon X2 server box. :) Serious optimisation and I learned a lot from the posts here from Markus Frind (I think). Not all of those URLs are worth including in the sitemaps because many domain names are one hit wonders that are only registered for a year and then never reregistered. A popularity ranking was used for some of the sitemaps with only domian names registered across TLDs being included. The launch of the new gTLDs in 2012 mushroomed the gTLDs from about 16 to over 1,200. It was possible to extend the algorithm but most of the registrations in the early phase of the launch of these gTLDs were speculative and many were targeting the popular names. The URLs have the hosting history of each domain name (which hoster provided DNS for them and whether the hoster was a PPC parker, a sales site, a brand protection hoster or just an ordinary hoster). With someone researching a domain name, it is possible to see if it was registered previously and had any potentially iffy history.

Google tried to kill off web directories about ten years ago. One of the things I am working on is the breakdown of the gTLDs by country and by web hosting provider. I have it at 99.36% resolution for all gTLD websites. These are the stats for gTLDs in some coutries for December:

Region - country - cc - providers - websites - identified - resolution - unidentified

It would be possible to build a directory of hosting providers from that. It would be different to the domain name data though it could be made searchable by provider.
AP - Australia - AU - 1,149 - 1,837,220 - 1,837,105 - 99.99% - 115
NA - Canada - CA - 301 - 6,142,334 - 6,106,987 - 99.42% - 35,347
EUR - Germany - DE - 4,886 - 17,601,604 - 17,512,947 - 99.50% - 88,657
EUR - Ireland - IE - 328 - 408,134 - 408,134 - 100.00% - 0
EUR - United Kingdom - UK2,219 - 2,013,181 - 1,950,560 - 96.89% - 62,621
NA - United States - US - 2,280 - 130,230,631 - 129,875,598 - 99.73% - 355,033

(had to re-edit as the WW software ignores tabs)


The data is updated monthly. There are about 1M hosters (DNS providers) but around 600K can be excluded as they only host a single domain name. (Auto configuration on large registrars create a DNS, MX and website automatically for some new registrations). There are millions of transactions (New/Deleted/Transfers) each month and that current data might be the most interesting for users and SEs. The most interesting for usres is actually the deletions and that directory idea is probably doable for both users and SEs and would cut the size of the current sitemaps considerably.

It really needs a new sitemap strategy to optimise things and focus on quality rather than quantity.

Regards...jmcc

shawnb61

7:18 pm on Apr 8, 2026 (gmt 0)



I view sitemaps & SEO here maybe a little differently... I went without sitemaps for a very long time, & the crawlers had to read a lot of the site to figure out what changed.

I sort my sitemaps in descending order by date. Crawlers (mainly google & bing) don't need to read my whole site now, just look at the last few entries.

So, my content is out there far more quickly, within a couple hours (the SEO aspect), & they don't have to continuously crawl my site all the time. Full site recrawls by googlebot & bingbot are very rare now. They used to do it somewhat regularly.

My site is not huge, but I think the same would apply to massive sites. Just help 'em find the new stuff & get it out there quicker...

jmccormac

6:01 pm on Apr 13, 2026 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If they are only reading the top few entries, that seems to be a bit of development. It does seem like they are adopting a "fresher is better" approach. It might be an interesting experiment to keep a "fresh" sitemap file with continually changing new URLs.

Regards...jmcc