All About Content

Forcing Googlebot to Observe the Sabbath

Posted by Melanie Phung on Friday, November 10, 2006 at 1:18 pm

Just when it starts to seem like I haven’t learned anything interesting in a while, I come across a thread called “Cloaking for Religious Reasons.”

Is there ever a good reason to engage in cloaking for the purpose of fooling Google? Even if God insists?

Turns out no — the problem being discussed could better be solved a different way (The problem: having to take down an e-commerce site in observance of Sabbath but needing to avoid search engine spiders replacing the entire site in their indexes/indices with the store’s “we’re currently closed” page. The solution: returning 503 errors).

Now I finally understand why B&H Photo wouldn’t let me place orders on their site at various times in the past. Turns out it wasn’t random… it was Saturday!

Tagged:

Comments (1)

Category: Cloaking,Google,Googlebot

More on Del.icio.us and ‘noindex’

Posted by Melanie Phung on Thursday, August 3, 2006 at 10:25 pm

As regular readers may know, I’m endlessly fascinated with how del.icio.us pages end up ranking well in search results, considering each page has robots noindex and noarchive instructions. About 2 weeks ago, I noticed that the snippet for the result (in Google) had changed. Whereas before it only displayed the URL, it now was also displaying text from within the page. (Compare this to what the same search result looked like earlier.)

Does this mean Google was not ranking the page based only on a “guess” regarding the page’s relevance, based on the combination of domain and URL? Consider, it had to actually crawl the del.icio.us page to display this snippet. It’s reasonable to assume that if it’s displaying the snippet text, it’s also reading and storing it somehow.

[So if "noindex,nocache,nofollow" together don't mean "don't crawl"... is there a robots tag (not including a robots.txt file) that would instruct a spider not to read the content at all?]

Social bookmarking sites like del.icio.us add the noindex robots instruction to discourage SEOs from gaming the site. The idea is that no one would bother posting not-bookmark-worthy links solely for the “link juice” — those links are not supposed to “count” (for link weight, not traffic, obviously). But I also thought noindex and nocache was supposed to prevent Google or other SEs from displaying snippets from the page, and that assumption was proved wrong.

If the del.icio.us page for a specific tag — your company’s name, for example — has PageRank value, ranks well for that keyword in a Google search, and lists your site at the top of the page, it becomes harder to believe, in light of this snippet being displayed, that there still is no IBL value in making sure your site is frequently del.icio.us’d.

Comments (4)

Category: Googlebot,Social Media

You Can Hide from Googlebot…

Posted by Melanie Phung on Sunday, April 30, 2006 at 12:24 am

… but you can’t hide from Google’s bots. Google has confirmed that it’s using multiple spiders to feed crawl results into Bigdaddy. Specifically mentioned is that the AdSense mediapartners bot (a.k.a. mediabot) is caching pages for the natural search index.

Jenstar points out:

It could definitely be used as a tool to detect when content is being cloaked for either the Google or AdSense bot, particularly since the mediapartners bot has been indexing pages since at least the beginning of February.

Who knows how many other-named spiders Google has doing recon like this. Definitely would make the old IP cloaking black hat trick a little trickier.

More info on the crawl caching proxy on Matt Cutts’s site.

Comments Off

Category: Cloaking,Googlebot

Google’s Interpretation of ‘noindex’

Posted by Melanie Phung on Tuesday, April 11, 2006 at 11:16 am

I asked around regarding my observation that Google is displaying pages in results even if they use the robots noindex meta-tag, and someone pointed me toward Matt Cutts’ March 17 blog post titled Googlebot Keep Out:

You might wonder why Google will sometimes return an uncrawled url reference, even if Googlebot was forbidden from crawling that url by a robots.txt file. Thereā€™s a pretty good reason for that: back when I started at Google in 2000, several useful websites (eBay, the New York Times, the California DMV) had robots.txt files that forbade any page fetches whatsoever. Now I ask you, what are we supposed to return as a search result when someone does the query [california dmv]? Weā€™d look pretty sad if we didnā€™t return www.dmv.ca.gov as the first result. But remember: we werenā€™t allowed to fetch pages from www.dmv.ca.gov at that point. The solution was to show the uncrawled link when we had a high level of confidence that it was the correct link. Sometimes we could even pull a description from the Open Directory Project, so that we could give a lot of info to users even without fetching the page. Iā€™ve fielded questions about Nissan, Metallica, and the Library of Congress where someone believed that Google had crawled a page when in fact it hadnā€™t; a robots.txt forbade us from crawling, but Google was able to show enough information that someone assumed the page had been crawled. Happily, most major websites (including all the ones Iā€™ve mentioned so far) let Google into more of their pages these days.

That makes great sense in theory, but what Google is telling users is that it thinks, it’s guessing, that this page which it hasn’t even looked at is very relevant… and not just very relevant but more relevant than all the other pages it has actually indexed. It’s one thing if they dig down deep on searches that don’t yield very many results, but to list these types of pages on the first page of results on searches that have tens of thousands (or more) results is just odd.

Nevermind that one would think a “noindex” robots meta tag means the search engine wouldn’t index the URL (not just that it wouldn’t index the page’s content). Okay, so the page will still show up in the index. And while Googlebot didn’t technically crawl the page, it will go ahead and return it in results based on… on…? keywords in the URL? What?

I’m not sure what the take-away is here (because this makes no sense); except that if you have a webpage that you don’t want users to find, don’t rely on robots exclusions to keep your page from showing up in the results (well, actually, don’t post anything you wouldn’t want people to find on the Internet).

Updated June 19: Looks like Google is indeed indexing pages that are tagged “no follow.” See this recent Webmaster World discussion.

Comments Off

Category: Google,Googlebot

Google Not Honoring ‘noindex’?

Posted by Melanie Phung on Monday, April 10, 2006 at 8:25 pm

Can anyone tell me what’s wrong with this picture?

It’s what appears on the first page of Google results today if you do a search on the term “technorati.” It’s a link to the del.icio.us page of items tagged “technorati.” But del.icio.us pages all use < name="robots" content="noarchive,nofollow,noindex">. In other words, the robots instructions on the page tell the search engines not to index the page! Noindex means it shouldn’t show up in search results.

What gives? Has Google started ignoring noindex?

Two thoughts: 1) Get ready for some aggressive del.icio.us tag spamming, and 2) how do we avoid getting in trouble for duplicate content if we can’t keep Google from indexing dupe pages using the standard robots exclusion?

Update: This question was answered in my subsequent post, Google’s Interpretation of ‘noindex’.

Updated June 19: Looks like Google is indeed indexing pages that are tagged “no follow.” See this recent Webmaster World discussion.

Comments Off

Category: Google,Googlebot