Posted by Melanie Phung on Tuesday, April 11, 2006 at 11:16 am
I asked around regarding my observation that Google is displaying pages in results even if they use the robots noindex meta-tag, and someone pointed me toward Matt Cutts’ March 17 blog post titled Googlebot Keep Out:
You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. Thereā€™s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? Weā€™d look pretty sad if we didnā€™t return www.dmv.ca.gov as the first result. But remember: we werenā€™t allowed to fetch pages from www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page. Iā€™ve fielded questions about Nissan, Metallica, and the Library of Congress where someone believed that Google had crawled a page when in fact it hadnā€™t; a robots.txt forbade us from crawling, but Google was able to show enough information that someone assumed the page had been crawled. Happily, most major websites (including all the ones Iā€™ve mentioned so far) let Google into more of their pages these days.
That makes great sense in theory, but what Google is telling users is that it thinks, it’s guessing, that this page which it hasn’t even looked at is very relevant… and not just very relevant but more relevant than all the other pages it has actually indexed. It’s one thing if they dig down deep on searches that don’t yield very many results, but to list these types of pages on the first page of results on searches that have tens of thousands (or more) results is just odd.
Nevermind that one would think a “noindex” robots meta tag means the search engine wouldn’t index the URL (not just that it wouldn’t index the page’s content). Okay, so the page will still show up in the index. And while Googlebot didn’t technically crawl the page, it will go ahead and return it in results based on… on…? keywords in the URL? What?
I’m not sure what the take-away is here (because this makes no sense); except that if you have a webpage that you don’t want users to find, don’t rely on robots exclusions to keep your page from showing up in the results (well, actually, don’t post anything you wouldn’t want people to find on the Internet).
Updated June 19: Looks like Google is indeed indexing pages that are tagged “no follow.” See this recent Webmaster World discussion.
No comments yet.
Sorry, the comment form is closed at this time.