I <3 Steve McConnell*
Coding Horror
programming and human factors
by Jeff Atwood

October 12, 2008

The Importance of Sitemaps

So I've been busy with this Stack Overflow thing over the last two weeks. By way of apology, I'll share a little statistic you might find interesting: the percentage of traffic from search engines at stackoverflow.com.

Sept 16th
one day after public launch
10%
October 11th
less than one month after public launch
50%

I try to be politically correct in discussing web search, avoiding the g-word whenever possible, desperately attempting to preserve the illusion that web search is actually a competitive market. But it's becoming a transparent and cruel joke at this point. When we say "web search" we mean one thing, and one thing only: Google. Rich Skrenta explains:

I'm not a professional analyst, and my approach here is pretty back-of-the-napkin. Still, it confirms what those of us in the search industry have known for a long time.

The New York Times, for instance, gets nearly 6 times as much traffic from Google as it does from Yahoo. Tripadvisor gets 8 times as much traffic from Google vs. Yahoo.

Even Yahoo's own sites are no different. While it receives a greater fraction of Yahoo search traffic than average, Yahoo's own flickr service gets 2.4 times as much traffic from Google as it does from Yahoo.

My favorite example: According to Hitwise, [ex] Yahoo blogger Jeremy Zawodny gets 92% of his inbound search traffic from Google, and only 2.7% from Yahoo.

That was written almost two years ago. Guess which way those numbers have gone since then?

Google generally does a great job, so they deserve their success wholeheartedly, but I have to tell you: Google's current position as the start page for the internet kind of scares the crap out of me, in a way that Microsoft's dominance over the desktop PC never did. I mean, monopoly power over a desktop PC is one thing -- but the internet is the whole of human knowledge, or something rapidly approaching that. Do we really trust one company to be a benevolent monopoly over.. well, everything?

But I digress. Our public website isn't even a month old, and Google is already half our traffic. I'm perfectly happy to feed Google the kind of quality posts (well, mostly) fellow programmers are creating on Stack Overflow. The traffic graph provided by Analytics is amusingly predictable, as well.

stackoverflow.com traffic graph, sep. 16 - oct. 11

Giant peak of initial interest, followed by the inevitable trough of disillusionment, and then the growing weekly humpback pattern of a site that actually (shock and horror) appears to be useful to some people. Go figure. Guess they call it crackoverflow for a reason.

We knew from the outset that Google would be a big part of our traffic, and I wanted us to rank highly in Google for one very selfish reason -- writing search code is hard. It's far easier to outsource the burden of search to Google and their legions of server farms than it is for our tiny development team to do it on our one itty-bitty server. At least not well.

I'm constantly looking up my own stuff via Google searches, and I guess I've gotten spoiled. I expect to type in a few relatively unique words from the title and have whatever web page I know is there appear instantly in front of me. For the first two weeks, this was definitely not happening reliably for Stack Overflow questions. I'd type in the exact title of a question and get nothing. Sometimes I'd even get copies of our content from evil RSS scraper sites that plug in their own ads of questionable provenance, which was downright depressing. Other times, I'd enter a question title and get a perfect match. Why was old reliable Google letting me down? Our site is simple, designed from the outset to be easy for search engines to crawl. What gives?

What I didn't understand was the importance of a little file called sitemap.xml.

On a Q&A site like Stack Overflow, only the most recent questions are visible on the homepage. The URL to get to the entire list of questions looks like this:

http://stackoverflow.com/questions
http://stackoverflow.com/questions?page=2
http://stackoverflow.com/questions?page=3
..
http://stackoverflow.com/questions?page=931

Not particularly complicated. I naively thought Google would have no problem crawling all the questions in this format. But after two weeks, it wasn't happening. My teammate, Geoff, clued me in to Google's webmaster help page on sitemaps:

Sitemaps are particularly helpful if:

  • Your site has dynamic content.
  • Your site has pages that aren't easily discovered by Googlebot during the crawl process - for example, pages featuring rich AJAX or Flash.
  • Your site is new and has few links to it. (Googlebot crawls the web by following links from one page to another, so if your site isn't well linked, it may be hard for us to discover it.)
  • Your site has a large archive of content pages that are not well linked to each other, or are not linked at all.

I guess I was spoiled by my previous experience with blogs, which are almost incestuously hyperlinked, where everything ever posted has a permanent and static hyperlink attached to it, with simple monthly and yearly archive pages. With more dynamic websites, this isn't necessarily the case. The pagination links on Stack Overflow were apparently enough to prevent full indexing.

Enter sitemap.xml. The file itself is really quite simple; it's basically a non-spammy, non-shady way to have a "page" full of links that you feed to search engines. A way that is officially supported and endorsed by all the major web search engines. An individual record looks something like this:

<url>
<loc>http://stackoverflow.com/questions/24109/c-ide-for-linux</loc>
<lastmod>2008-10-11</lastmod>
<changefreq>daily</changefreq>
<priority>0.6</priority>
</url>

The above element is repeated for each one of the ~27,000 questions on Stack Overflow at the moment. Most search engines assume the file is at the root of your site, but you can inform them of an alternate location through robots.txt:

User-Agent: *
Allow: /
Sitemap: /sitemap.xml

There are also limits on size. The sitemaps.xml file cannot exceed 10 megabytes in size, with no more than 50,000 URLs per file. But you can have multiple sitemaps in a sitemap index file, too. If you have millions of URLs, you can see where this starts to get hairy fast.

I'm a little aggravated that we have to set up this special file for the Googlebot to do its job properly; it seems to me that web crawlers should be able to spider down our simple paging URL scheme without me giving them an explicit assist.

The good news is that since we set up our sitemaps.xml, every question on Stack Overflow is eminently findable. But when 50% of your traffic comes from one source, perhaps it's best not to ask these kinds of questions.

pixelated google overlords

Just smile and nod and follow the rules like everyone else. I, for one, welcome our pixelated Google overlords!

[advertisement] Peer code review without meetings, paperwork, or stopwatches? No wonder Code Collaborator won the Jolt Award.

Posted by Jeff Atwood    View blog reactions

 

« Cross-Site Request Forgeries and You Preventing CSRF and XSRF Attacks »

 

Comments

I thought sitemaps would be one of the first things a webmaster would built when launching a website (whether dynamic or not).

Jaryl on October 13, 2008 07:05 AM

I certainly never needed a sitemap on codinghorror.com.

Jeff Atwood on October 13, 2008 07:07 AM


'There are also limits on size. The sitemaps.xml file cannot exceed 10 megabytes in size, with no more than 50,000 URLs per file. But you can have multiple sitemaps in a sitemap index file, too. If you have millions of URLs, you can see where this starts to get hairy fast.'

I can see how these constraints will led to some difficult to maintain hacks for the sitemap file for stackoverflow.There has got to be a simpler way for the Googlebot to work correctly.Hairy indeed.

o.s. on October 13, 2008 07:12 AM

SO is nice, but CodingHorror is still my crack of choice. Welcome back.

Charles on October 13, 2008 07:13 AM

It doesn't sound very scalable - that file must be a real hotspot for a site with the amount of activity that Stack Overflow gets.

Also, how are you determining the changefreq and priority for individual questions?

John Topley on October 13, 2008 07:13 AM

Scalability is (should be) a non-issue for sitemap.xml. The purpose of this file for a large site isn't to list hundreds of thousands of unique URLs at once, but rather to allow spiders to discover these urls ONE TIME. Once the initial discovery has happened, Google should (for a high-traffic, widely-linked-to site) continue to spider those URLs, which in turn link to neighbor URLs, and so forth. In this way you can spider an entire site of many 10,000s of URLs via a few thousand URLs in the sitemap.

www.codingthewheel.com on October 13, 2008 07:20 AM


'I certainly never needed a sitemap on codinghorror.com.'
Jeff I think in the case of Coding Horror there were plenty of trackbacks and other blogs that linked to your posts making it easier for the somewhat dimwitted Googlebot to find your posts.

o.s. on October 13, 2008 07:24 AM

Good post, interesting!

On a side ntoe if I click the www.codingwheel.com author name in the comment above in firefox 3 I get a content encoding error page cannot be displayed. Just a heads up =p

Tom J Nowell on October 13, 2008 07:31 AM

Welcome back!

Eduardo Diaz on October 13, 2008 07:35 AM

You don't *need* the sitemap -- you can wait till Googlebot gets around to index your site. Apparently, that wasn't good enough for you. Don't blame your impatience on the poor bot ;)

x on October 13, 2008 07:39 AM

You may be drawing causality from coincidence on the sitemap.

The Google algorithm usually displays new sites high in the rankings immediately. Then, "sandboxes" them for a few days/weeks, until they gain PageRank. Finally, they pop back to an accurate position.

During that sandboxed period, it's normal to search for unique terms and find other (not sandboxed) sites, yet not your own.

I've seen that pattern play out with every new site I launch, independent of SEO efforts (including sitemaps).

Dave Ward on October 13, 2008 07:43 AM

That's very interesting. I'd heard vaguely of the idea sitemaps but had no idea it could make such a huge difference.

I do find it vaguely disturbing that my first instinct after reading this was to find and click the "Upvote" button.

Mark Biek on October 13, 2008 07:47 AM

At last. Something I already knew that Jeff didn't!

jake on October 13, 2008 07:55 AM

hi jeff,

wouldn't links like (naive example, i know)

http://stackoverflow.com/questions/page/2/sort/hot

instead of

http://stackoverflow.com/questions?page=2&sort=hot

also "convince" google to follow all you links

i think google is not happy with the "dynamic" parts of the url e.g. "?" or "&" ....

Marcel Sauer on October 13, 2008 08:04 AM

Very interesting. I've often wondered about this kind of thing, but never really did any investigation (AS i'm not a webmaster)... I'm hoping my blog on blogspot (being owned by google) will be indexed properly... But i'll definitely keep this in mind with future websites.

http://www.samalamadingdong.com

sam on October 13, 2008 08:08 AM

I tend to read stackoverflow from the google.com/ig page of feeds. You think that impacts your numbers or not? Should that traffic be attributed to Google?

I personally don't think so.

Dan on October 13, 2008 08:09 AM

Welcome Back

Ahmed on October 13, 2008 08:16 AM

Jeff, you might want to check out this StackOverflow question:

http://stackoverflow.com/questions/72394/what-should-a-developer-know-before-building-a-public-web-site

Joel Coehoorn on October 13, 2008 08:33 AM

So Microsoft doing the same thing is ok but Google taking over the world is... well at the end of the post you seem ok with that too?

Uh... ok. I think a little competition would be a good thing in this case.

Cecil on October 13, 2008 08:35 AM

I regularly find that most of the traffic to one of my static sites comes from places like webcrawler.com, shopping.com, dealtime.com and aol.com, with google often ranking 5th or 6th on my new traffic statistics.

Strangely, the site has absolutely nothing to do with any saleable or commercial product. Go figure.

Mark on October 13, 2008 08:38 AM

Use http://stackoverflow.com/questions/page/3

Nicolas on October 13, 2008 09:27 AM

I adore self-referential posts as much as the next guy, but did you really have to link to stackoverflow twice (four times if you count crackoverflow, since it links there, too, and your post relating to the launch of stackoverflow)?

Stephen on October 13, 2008 09:33 AM

Why don't you just have your sitemap link to the pages of the lists of questions? Would the bot not then see all the questions, but with far fewer entries in the sitemap?

Justin on October 13, 2008 09:41 AM

welcome back jeff. i'm horrored again.

Jin on October 13, 2008 09:44 AM

Google's not the real start page for the Internet -- there can only be one Opening Page:
http://www.openingpage.com/

Neil (SM) on October 13, 2008 10:01 AM

> Google's current position as the start page for the internet kind of scares the crap out of me

I know what you mean. But we do have Wikipedia too. I think that’s mainstream enough to offer a reasonable alternative if Google goes evil.

Of course, people would have to notice Google going evil.

Paul D. Waite on October 13, 2008 10:16 AM

> Google's current position as the start page for the internet kind of scares the crap out of me, in a way that Microsoft's dominance over the desktop PC never did.

Let's say that Microsoft did something utterly, absolutely, unequivocally evil--fed puppies into wood chippers while chanting Satanic prayers, say. What would it take for their users to move elsewhere? You'd need to install a new operating system. Do you use a non-web email client? It doesn't run on your new OS. Your games? None of them run. Word and Excel? Gone. Your music collection? Hope it doesn't use Windows-only DRM. And so on.

Now consider Google, instead. What does it take to stop using Google search? Nothing. You just...stop. Spend thirty seconds changing your homepage, maybe, and changing the default search engine in your browser.

Microsoft's monopoly position has *leverage*. You can choose not to use their products, but you can only do so by cutting yourself off from everyone who views the world through Windows-colored glasses.

Google, in contrast, has a fragile monopoly. I can choose to stop using Google right now, without cutting myself off from anything.

And that is why it's absurd to compare Microsoft's several monopolies, which they have repeatedly leveraged to dominate additional markets, with Google's transient and fragile position as the de facto default search engine of the net.

Damien Neil on October 13, 2008 10:34 AM

[not directly related to post topic]

I like the pixelated images. They look pretty detailed and accurate.

I wonder if there's software that outputs similar effects using a real photo.

Abdu on October 13, 2008 10:35 AM

Isn't the weekly cycle simply showing less activity at weekends?

Y on October 13, 2008 11:12 AM

Isn't the weekly cycle just showing less activity at weekends?

Y on October 13, 2008 11:13 AM

The reason it doesn't follow your archived posts (up to 931) is because it tries to find repetitive cycles and exit on them (which I think is true whether using querystring or just in the url). It finds a pattern of repeating numbers, goes to maybe 10 or so, and then breaks the loop.

I don't recall where I read it, but just imagine if Google followed every calendar mini-app on the Internet forever: it would never stop.

Mark on October 13, 2008 11:16 AM

How about creating:
http://stackoverflow.com/questions?page=1&sort=oldest
In which page 1 has the oldest posts. That way, page 1-2k never change all that much (votes etc change, but question URLs don't). Then just put that URL till the newest page in your sitemap, you'll scale a lot better.

wds on October 13, 2008 11:28 AM

Er, did I just expose a bug with that link or something? It seems to be showing questions sorted by vote, but claims it's by hotness.

wds on October 13, 2008 11:30 AM

Hey Now Jeff,
Great points!
Coding Horror Fan,
Catto

Catto on October 13, 2008 11:53 AM

Thanks for this review really good info

the news empire on October 13, 2008 11:56 AM

Those pixeled pictures look like Transport Tycoon managers.

Hoffmann on October 13, 2008 12:02 PM

@Damien Neil:

First off, the Google login also ties people's emails, browsing history, web browser (in the case of Chrome), etc., together :). Many sites have search functions "powered by Google", and of course, their ads are everywhere and fund sites all the time which are much harder to avoid.

But the problem is that it's not really about you changing search engines. The search near-monopoly might be leveraged in the advertising field. In the same way as it's not really about Microsoft cleaving to the x86 (and x64 and Itanium) architectures, it's about the other end where the desktop OS might be leveraged in other software markets.

In neither case am I convinced that anything is going on right now, but you have to be vigilant.

The other potentially scary things about Google are the information they can gather on you, which you really can't stop (only cut off the flow of new information, and even then with tracking pixels and so forth you can't truly cut them off), and the risk of censorship, which Google has a fairly decent record on, China notwithstanding, and is more easily avoided by switching search engines...if you know that the censorship is going on (it doesn't even have to be intentional censorship -- you'll have a hard time displacing wikipedia even if you make a better product in large part because Google will likely rank wikipedia #1 for just about every possible search for a while to come).

Ens on October 13, 2008 12:31 PM

sitemap.xml is not just a Google thing. Every major search engine understands it.

Sitemapper on October 13, 2008 12:37 PM

Hey Jeff, do you still know of any queries where Google isn't doing well in terms of returning your pages when you searched for the title of an article on stackoverflow.com? I'd be very interested to hear of concrete example queries that I could convey back to the crawl/indexing/ranking folks over at Google. We always want to improve or else people will go to another search engine.

Matt Cutts on October 13, 2008 12:38 PM

I can't tell enough from what you've written if it really was a sitemaps file solving the problem or other issues that can be common to a new site launching.

Regardless, glad it was fixed. But I have to completely disagree with your idea that you somehow shouldn't need to do something "special" for Google to do its job properly...

First, Sitemaps is a common standard supported by Microsoft and Yahoo, as well.

Second, people do all types of things in coding pages to accommodate those using particular browser, particular plug-ins and so on. Search engines are effectively the most common browser out there. Taking a minor amount of effort to ensure your site renders in them properly can deliver, as you have found, a huge amount of traffic gain. So if you're having to consider them a bit, that's just the routine of good web development, not something "special" in my book.

Third, by and large, you don't have to have do stuff to be crawled. The vast majority of web sites don't use sitemaps and get indexed just fine.

Danny Sullivan on October 13, 2008 12:45 PM

seems like the time to develop a sitemap.xml (and the strategy for maintaining it as it grows) is a heck of a lot cheaper and more cost effective than creating the search engine. Since you were already wanting to just leverage Google for that feature the "cost" associated with sitemap.xml doesn't seem all that bad.

So, rather than being "...a little aggravated that we have to set up this special file..." you should just going back to being happy that the problem of search was solved so easily for you.

Be an optimist.

Bill on October 13, 2008 01:09 PM

Yay Jeff posted!

Pat H on October 13, 2008 01:47 PM

Thanks for this gem of insight. One to add in to my new site rebuild.

Keep that coding crack coming!

Jason Snelders on October 13, 2008 02:33 PM

@Matt Cutts:
For instance, "n-ary trees in c" and every other variant I've tried thus far fails to return http://stackoverflow.com/questions/189855/n-ary-trees-in-c#189900 . I wouldn't feel too bad about it though; stackoverflow's own search can't seem to find it either ;-)

Great work on your blog, by the way (you too, Jeff)!

Matt Johnson on October 13, 2008 04:05 PM

Sorry, the link should be http://stackoverflow.com/questions/189855/n-ary-trees-in-c (I don't expect Google to actually link to my post directly ;-)

Meta-comment: The fact that I don't have to create an account is great, but the fact that you can't edit posts makes for a lot of comment chaff (like this). Is there any way one could allow post editing based on e.g. possession of a cookie?

Matt Johnson on October 13, 2008 04:10 PM

Hi Jeff,
What do you think about the relevance or non-relevance of using meta tags in webpages ?

Has it gone out of vogue or is it still useful?

Uma on October 13, 2008 05:30 PM

> Google's current position as the start page for the internet kind of scares the crap out of me, in a way that Microsoft's dominance over the desktop PC never did.

Wow, I thought I was the only one. I feel exactly the same here. Google scares me a lot more than MS has ever done. It's probably a great place to work but when I read Steve Yegge's blog I can't help thinking it's more like a cult than a company, but maybe that's just me...

Andreas on October 13, 2008 11:40 PM

Why not reserve the page ID's, so that page 1 always lists the first 10 questions ever posted, page 2 always lists questions 10 to 20, etc, etc.

http://stackoverflow.com/questions?page=2

And when the page number is not specified, default to the last page (e.g. 9182).

That way the page content does not change every time the search engine indexes it.

Craig Francis on October 14, 2008 01:23 AM

@Matt Johnson:
Of course now that you posted that link to codinghorror, that article is the top result: http://www.google.com/search?q=N-ary+trees+in+C
;)

Qvasi on October 14, 2008 05:02 AM

@Jeff:
I agree with Damien Neil. I'll have to admit that, as a Microsoft guy, you don't surprise me by saying that Google's dominance scares you more than Microsoft, but let it be heard that changing your internet homepage is A LOT easier, quicker, cheaper, and less hassling than changing your desktop Operating System.

Say it takes you ten seconds to change your homepage, and three hours to overhaul your PC. Mathematically, it's 1080% more time-consuming to get rid of Microsoft than it is to get rid of Google.

Google shuts its doors this afternoon, millions of people will start using Yahoo, Ask, or MSN instantly. Microsoft does the same, and millions of people have no idea what to do when something goes wrong.

Market dominance always leads to complications, but as dependent as we choose to make ourselves on the web, it's still a very volatile place.

Chris on October 14, 2008 06:17 AM

crackOverFlow.com has needles in its logo. But crack is smoked not injected.

ogem on October 14, 2008 06:28 AM

Hmmm...would it be possible I wonder to do something crazy with Mod_Rewrite to generate this on the fly? If you use a database-backed site, you could write a script which dynamically kicks out the current state of the site into sitemap format and then rewrite requests for sitemap.xml to your script.

The bot requests sitemap.xml and recieves an up-to-the-second sitemap. If you've got a bit site, I'm sure it wouldn't be too hard to feed the bot a bunch of dynamically generated 50k-per-file sitemaps.

This would certainly get past the maintainability problem, but would it work?

Dave on October 14, 2008 06:30 AM

I guess the need for sitemaps is simple: if a webcrawler starts "going wild" it may end in links like, I dunno http://site.com/questions?page=12345

And a human can tell there's nothing on page 12345 BUT A BOT CAN'T

James on October 14, 2008 06:54 AM

Hehe I tried to see how you had it implemented now (if at all), but your robots.txt points to sitemap.xml, which doesn't exist (yet?).
Now I'm out to try and find a sitemap file on other websites, see how they have it covered. I'm really interested what would be the best way to get your whole site (eg all the questions on SO) indexed within the limits of the sitemap file.

Bucket on October 14, 2008 07:07 AM

-Addition-
Google: Lists a LOT of URL's, not really useful
Wikipedia: no Sitemap file
MSDN Social: Now this is interesting, not only do they specify multiple sitemap files inside their robots.txt (probably one for each category), they have crawler specific URL's in them, probably each generating a list of posts (http://social.msdn.microsoft.com/robots.txt).
Tweakers.net (Dutch tech site): lists multiple Sitemap files in their sitemap file, each pointing to a range of ID's (http://tweakers.net/sitemap_index.xml).

This is all very interesting!

Bucket on October 14, 2008 07:11 AM


Thanks Jeff, this sitemap thing is new to me, too !


>> Wikipedia: no Sitemap file
Maybe this is a stupid question but how did you find that ?

Mediocre-Ninja.blogspot.com on October 14, 2008 07:44 AM

@Qvasi: Ahh, I should have remembered the laws of internet quantum mechanics.. if you complain about something not being found by Google on a popular website, it will be found ;-)

Matt Johnson on October 14, 2008 08:08 AM

I have iGoogle as my start page of browser. It has nice layout for stuff and links. Plus I use Google as a search engine. I have 2 google search bars in my browser too. They remember my last searched words and they suggest words. So when I write "c" to the search box, it suggests at the top "Coding Horror". So many times I don't have to write but one letter and I get what I need without Google.com page even.

Google remembers posts that I have posted to a forum but moderator deleted for some reason. I had a valid post, but it was deleted along some other people's posts. I typed couple of words to Google and I got my post back.

Silvercode on October 14, 2008 09:46 AM

Two things:

1. The dominance of google is a little uncomfortable, but will become less so if they ever provide us developers a way to force an include of symbols we need in a search. Quotes don't do anything when you're looking for information on anything that has a $ in it, for example. If you want to look up something for, say, the "$get" asp.net ajax shortcut, you'll be sadly disappointed on how impossible it is to sort through all the results that google searching for "get" will give you. :P

2. You must be doing something really right; I was googling around recently for clarification on an answer I was writing to an SO question that had just been posted, and the very first result on google was... the question I was answering on SO. Which somehow got indexed within about 5 minutes of it being posted!

Grank on October 14, 2008 09:48 AM

Working on a small crawler for a few years I have at least concluded for myself that the idea of using sitemaps is completely reasonable. One problem that has come up is exactly what Mark above stated about crawling calendar widgets on sites.

Another issue focused on stores encoding the breadcrumb path in the url with their unique ID combined with the bug(?) of being able to essentially cycle through them infinitely producing a url such as , /1/2/3/1/2/3/2/4/3/product.asp?product_id=987 and having a breadcrumb of Baseball > Jerseys > Mets > Baseball > Jerseys > Mets > Jerseys > Away Jerseys > Mets.

I think Google (and by Google, I mean the entire Search Engine Conglomerate) would be a lot more likely to believe your claim of, "I really do have 1,000 pages each linking to interesting questions on my site!" if it weren't for simple bugs (or were smarter about making it explicit to not follow links on a calendar for example) that other websites have in their linking.

Dan on October 14, 2008 11:05 AM

Jeff,

I think you are uniquely qualified to address a post by the Google Webmaster blog eloquently. I only wish I had the knowledge and the audience to truly address their claim that dynamic URLs are properly indexed. The evidence in this post seems to say otherwise, however.

http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html

As to the content of this post:

I did a redevelopment of http://sixteencolors.net almost a year ago. I switched to "static" (URL Rewriting) URLs and created sitemap XML files. My hits from search engines almost immediately increased 50-100x the previous numbers. I previously received 1-5 visits from search engines, and the number now floats around 150-200. I didn't really advertise anymore (I was linked from a number of obscure blogs) and the number has remained steady since that time. My belief is that a mixture of the sitemaps, static urls, and switching to semantic HTML has been the reason for the increased number of hits.

Doug Moore on October 14, 2008 11:33 AM

Great post, Jeff! Glad to see you are back!

I didn't see anyone had posted this, but a great place to learn about sitemap protocol is here:

http://www.sitemaps.org/

Jack on October 14, 2008 11:54 AM

Google has no real leverage to preserve their monopoly. The minute they start under-performing, they're vulnerable to competition.

The people who have any reason to be scared are those who have all their email, calendars, and personal data stored in Google's cloud.

Matias Nino on October 14, 2008 12:31 PM

What a joke.

spoiled_blogger on October 14, 2008 01:23 PM

I have hundreds of interior categories on one of my sites that I wish would show up in search engines. I added a sitemap.xml, but that did not help. I kind of think you need a few (dozen?) outside links into your underlying categories for the search engines to take them seriously.

Very interesting stats on the power of Google, btw, thank you.

Ted Murphy on October 14, 2008 08:35 PM

Hi Matt

> Hey Jeff, do you still know of any queries where Google isn't doing well in terms of returning your pages when you searched for the title of an article on stackoverflow.com?

Now that sitemaps.xml is in play, everything is working exactly as I would expect it to. We've done some pruning in robots.txt to remove duplicates, but that's about it.

> I'd be very interested to hear of concrete example queries that I could convey back to the crawl/indexing/ranking folks over at Google

The only suggestion I have is the one in the article -- it's a bit disappointing that googlebot couldn't seem to crawl all our questions, as they are directly hyperlinked from each page:

http://stackoverflow.com/questions
http://stackoverflow.com/questions?page=2
http://stackoverflow.com/questions?page=3

However, now that I've implemented sitemap.xml, I'm starting to come around to the concept. It's probably more efficient to feed search engines a sitemap.xml file containing links to each question, than it is for us to serve up full pages of markup, javascript, etc to every single search engine out there!

Jeff Atwood on October 15, 2008 12:38 AM

Jeff,

You might try using <LINK rel="Next|Prev" href="nextquestionURL.html"> in the <HEAD> of each StackOverflow question. Lots of blog templates use it. The W3C spec is here: http://www.w3.org/TR/REC-html40/struct/links.html#h-12.3

Also, I agree with other posters here: Google prefers stackoverflow.com/page/2/ to stackoverflow.com/questions?page=2 because Google wants to index stable pages, not the results of dynamic queries.

OT for WordPress users: there's a good WordPress plugin to generate sitemap.xml. Search for "wordpress sitemap generator", it's the top result.

Nathan Bowers on October 15, 2008 02:18 AM

> There are also limits on size. The sitemaps.xml file cannot exceed 10 megabytes in size, with no more than 50,000 URLs per file. But you can have multiple sitemaps in a sitemap index file, too. If you have millions of URLs, you can see where this starts to get hairy fast.

You can have several sitemaps for your site and even several sitemap indexes, so there are no scalability problems.

Also consider submitting your sitemap.xml through Google Webmaster Tools -- it will allow you to see useful statistics/warnings/errors (for example, if some pages in your sitemap.xml can't be accessed by Google).

Roman on October 15, 2008 02:45 AM

For those wanting to know where you find out about sitemaps, take a look at Evil Google's webmaster tools and read the sitemaps section.

Here's a direct link to their info: http://www.google.com/support/webmasters/bin/answer.py?answer=40318&hl=en

nic on October 15, 2008 02:48 AM

"I'm a little aggravated that we have to set up this special file for the Googlebot to do its job properly; it seems to me that web crawlers should be able to spider down our simple paging URL scheme without me giving them an explicit assist."

Interesting point, tying back to Google's current dominance of the search market. You'd expect spider's to go out of they're way to index websites, and in the case of a startup launching something new and eager to get more information and more usage for example, they'd try to do just that. But Google being the biggest show in town, gets to reverse the rules. If you want your website properly indexed, you're expected to play by the rules - if you don't, it's your loss, not theirs.

That aside, I concur that the minute that Google violates their "Don't be evil" moto (if it actually happens, and I hope it doesn't) and crosses one line or other, it won't be very hard for disgruntled internet users to switch search engines.

Yannis on October 15, 2008 07:20 AM

"Google has no real leverage to preserve their monopoly. The minute they start under-performing, they're vulnerable to competition."

Well, when your company has become the default VERB for the task you perform, that may not be as true as you think, especially when that task is *INFORMATION DELIVERY*. After all, suppose a search engine came along that did everything Google did, and added a few brilliant features that we could really use. How would we find out about this wondrous product? Well, we'd... google... for...
Ah.

It'd be like trusting a financial advisor to tell you that another company would be a safer bet than his own. It'd be nice if he did, but would you put money on it?

Tom Clarke on October 15, 2008 07:32 AM

My work has now blocked access to stackoverflow

So sad!!!

TehOne on October 15, 2008 10:17 AM

@Mediocre-Ninja.blogspot.com on October 14, 2008 07:44 AM:

>>> Wikipedia: no Sitemap file
>Maybe this is a stupid question but how did you find that ?

On the wikipedia site the robots.txt has no reference and sitemap.xml doesn't exist.

Bucket on October 15, 2008 11:27 AM

@TehOne
If your employer is blocking your access to informative websites that will help you to do your job then its time to find new employers.

o.s. on October 15, 2008 01:13 PM

> It'd be like trusting a financial advisor to tell you that another company would be a safer bet than his own. It'd be nice if he did, but would you put money on it?

http://whimsley.typepad.com/whimsley/2008/03/mr-googles-guid.html

Jeff Atwood on October 15, 2008 01:18 PM

> My work has now blocked access to stackoverflow

does your work use blocking software?

We had a problem early on with Websense. I followed up with Websense on September 10th, and stackoverflow.com is currently classified "information technology". Just as an example.

Jeff Atwood on October 15, 2008 01:21 PM

"By way of apology, I'll share a little statistic you might find interesting: the percentage of traffic from search engines at stackoverflow.com."

This is not much of an apology. You had 1 minute to post a "I am very busy right now with SO", but you did not. This implies a certain disdain...

Pardeep on October 15, 2008 01:51 PM

> After all, suppose a search engine came along that did everything Google did, and added a few brilliant features that we could really use. How would we find out about this wondrous product? Well, we'd... google... for...

How do you think we all learned about Google? Not from searching for "search engines" at whatever we each used prior to Google, I suspect.

Cuil launched not long ago, and immediately picked up a number of mentions in newspapers, blogs, forums, chat systems, watercooler conversations, and so forth. I heard about it. Odds are reasonably good that you did as well. If you didn't--well, you just did now.

Of course, Cuil turned out to be not very good--it collapsed under the load of its own launch buzz and produced some hilariously bad results to various search terms. Nobody is talking about them any more. So it goes. (They've improved since then, but they'll need to do something *really* special to overcome that bad first impression.)

The point is, Google's eventual replacement doesn't need to rely on Google to find users. The world is filled with ways of propagating information.

Damien Neil on October 15, 2008 03:17 PM

Jeff,

Googlebot CAN crawl all of your paginated pages. You need to use this format for your URLs:

http://stackoverflow.com/questions
http://stackoverflow.com/questions/page/2
http://stackoverflow.com/questions/page/3

It's pretty well known in SEO circles that Google never follows a link with GET arguments because it is dangerous for it to do so - following these links can cause unwanted script execution because many programmers do not follow HTTP standards and implemented distrctive queries using scripts that are accessed with GET.

For instance, you may have a poorly implemented command like

http://stackoverflow.com/admin.php?cmd=deleteEverything

If Google followed this kind of link it would cause obvious problems, so they don't.

The solution to this is to use a URL formatted query string. This is pretty simple using mod_rewrite, Django, Druapl, or any of a number of other frameworks.

If you make this simple change to how your article list pages are accessed, the Google crawler will suddenly see your entire site.

Isaac Raway on October 15, 2008 07:14 PM

Oh my god! There's only one search engine and all the others have gone away! So if google starts giving unhelpful and useless search results, we'll have nowhere to turn to!

We're doomed!

Either that or we'll go back to yahoo when yahoo is slightly better than google again.

And the good news is that if the sum of all human knowledge amounts to YouTube and MySpace, it would merely be evidence that being able to access the sum of all human knowledge just isn't that important anyway.

Paranoid on October 15, 2008 09:37 PM

I'm not at all worried about google's dominance of the "web's homepage". If google were to disappear today, I doubt it would take very long at all for people to adapt - it's not like it's hard to use another search engine, there's virtually no cost to switching.

Much more problematic would be be the disruption to all those sites that depend on adsense for a large portion of their income. If google has become indispensable, it's not for their search page, but for their services (mostly ads. but gmail and some other apps also spring to mind). It's much less easy to switch services, and their relative advantage in that field is also far greater.

On the topic of poor indexing performance: Given the occurrence of infinite sets of URL's (a bad paging control would suffice, as would the breadcrumb issue described by Dan) and broken urls with damaging side effects, and the necessity to actually pay for the servers used to store the indexes, it shouldn't be that surprising that the crawler is at least somewhat conservative on new sites - it's literally impossible to be perfect, since a naive crawler might damage the site and/or get lost in an infinite loop.

Incidentally, contrary to what Isaac states, google does crawl urls with a query string (of course, if urls change all the time you'll have problems).

Eamon Nerbonne on October 16, 2008 01:54 AM

Isaac:

You are not correct. Here is an example of 50.000 pages on google that are indexed with a get-parameter:

http://www.google.com/search?q="Line+1%3A+Incorrect+syntax+near+"+filetype%3Aasp

Espo on October 16, 2008 04:54 AM

Great article on sitemaps. I have recently added a sitemap for my site and while I have over 10000 links, Google actually took about 200 pages out of their index leaving me with just over 1000 indexed pages. The website has been around for over a year now.

What would cause this? We even advertised with Adwords a time or two.

Thanks.

Jordan on October 16, 2008 02:55 PM

Jeff, on a semi-related note, if you added the title of the post to your URL on this blog (e.g. http://www.codinghorror.com/blog/archives/the-importance-of-sitemaps.html instead of http://www.codinghorror.com/blog/archives/001174.html), you would probably get a lot more search traffic. Google counts text in the URL very strongly.

Bob Hiler on October 16, 2008 05:49 PM

Well, well, well. Apparently I need to do more fact checking before I post well-known myths.

According to this article: <a href="http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html">http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html</a>; the problem may actually be the opposite of what I was saying - that these links that appear to be static such as "<a href="http://stackoverflow.com/questions/24109/c-ide-for-linux">http://stackoverflow.com/questions/24109/c-ide-for-linux</a>";, which is in fact a dynamic page with frequently changed content, actually hurts their changes of being re-index. From the article:

"One recommendation is to avoid reformatting a dynamic URL to make it look static. It's always advisable to use static content with static URLs as much as possible, but in cases where you decide to use dynamic content, you should give us the possibility to analyze your URL structure and not remove information by hiding parameters and making them look static."

Using GET parameters instead of the 'permalink' slug style actually causes Google to automatically assume the page is more dynamic.

It looks like there may also be an issue with links like this, which represent the answers to questions: "<a href="http://stackoverflow.com/questions/24109/c-ide-for-linux#24119">http://stackoverflow.com/questions/24109/c-ide-for-linux#24119</a>";

It would seem that to Google, this appears to be a static marker into a static file. Therefore they may, according to this article, assume that the file hasn't changed. If anyone links to a URL like this one, Google will see it and decide that it does NOT need to look at the page again, because it "isn't" dynamic content. Of course if they parse out anchor tags for every indexed file and see in their database that #24119 wasn't there when they indexed it last, they might decide to take another look at the file. However, this behavior is unlikely as it goes against their recommendation to NOT make dynamic URLs appear to be static ones.

I really doubt that a sitemap with ~27,000 entries is the intended usage pattern, and suspect strongly that there are gains to be made by changing the way that things are accessed. Google says so, and anyway a sitemap that large certainly isn't very _elegant_.

Isaac Raway on October 16, 2008 08:32 PM

The failure page for the "Enter the word" thing caused my links to be double up. Sorry about that. Might want to fix that bug too...

Isaac Raway on October 16, 2008 08:32 PM

Jeff,
I don't understand what you and some of the others have against google. The best I can make out is you must be following the old proverb "better to deal with the devil you know (Microsoft) than the devil you don't know (Google)". In my opinion that proverb does not apply in this case because... google is not a devil. I don't think anybody can seriously question that Microsoft has perfomed questionably acts. And they were fact to be a monopoly which was being used abusively.

Can someone please, give *1* example of where google has screwed any segment of the population? Did google tell you the only to way to put SO on the web was to go through them? Did they *tell* you what software to use? Did they mandate your operating system? Development tools? What exactly have they done to screw you (Jeff) or anyone?

What I know is this:
1) I've been using them since 2000-2001.
2) They're good at searches for me. Very good. I'm a satisfied customer.
3) I like the e-mail services. I can use pop and I do so.
3) They don't spam me.
4) They don't try to tell me what OS to use, or what browser.
5) They don't install spyware on my box.
6) They don't charge me to do searches.

What exactly is there not to like? I can see why Microsoft is nervous -- they're not in the search game and will never be at this rate -- which means they don't get any advertising $$. And they're always afraid that someone like Google might push out an OS that makes Microsoft irrelevant for everything except maybe game consoles.

But what exactly have they done to you Jeff that makes you nervous?
You work with Microsoft products for now almost 2 decades. You know their business history and practices.

Microsoft doesn't make you nervous. but google does?!?!

Here's another perspective that might help you see who you should really be worried about: if tonight you suddenly decided to replace the whole Microsoft toolchain for say LAMP (Linux, Apache, MySQL, Perl/Python/Php) ok, leave the web content exactly the same would Google call you all pissed off and screw you over? Would they even care? Google does not care how *you* get the content on *your* website, just that it's there and its crawlable. Your content and your business is your own.

Johnny on October 19, 2008 02:33 PM

Hey, you can get a sitemap generated for you for free at http://www.bitbotapp.com
-h

Henry on November 10, 2008 01:50 PM







(hear it spoken)


(no HTML)




Content (c) 2008 Jeff Atwood. Logo image used with permission of the author. (c) 1993 Steven C. McConnell. All Rights Reserved.