Google Blog Crawl … Yeah Right

August 6th, 2007 by Michael Gray in Google


If you're new here, you may want to subscribe to my RSS feed. Read my top posts or learn more about Michael Gray. Want more frequent updates follow me on Twitter. Thanks for visiting!

I’m sure there’s a perfectly good reason why Google includes data from adsense crawls into the index right away, but not data from blog ping crawls. I mean the reason they do it with adsense is to “reduce bandwidth”, so not applying the same logic to blog crawls is completely logical isn’t it.

Really just think how much bandwidth they could save themselves and you by making that change … that is of course provided that “saving bandwidth” was really the ultimate goal and not a nice bit of window dressing to distract you from a different priority …

Of course someone will state that blog search comes from a different data set, but if they are smart enough to get adsense data to incorporate they are certainly smart enough to get blog crawl data to incorporate, especially with all that “bandwidth savings” they will realize.

Sphere It

Text Link Ads


4 Responses to “Google Blog Crawl … Yeah Right”

  1. Barry Welford Says:

    Good point, Michael. You mention that BlogSearch works with a different data set, which is true. However given your and others’ concerns on WordPress and feeds creating duplicate content leading to banishment to the Supplemental Index, it looks as though that ‘different data set’ is also incorporated in the main data set. Given that I have raised the question, Should Google Have Smarter Robots? If all this is based on my lack of understanding, it would be useful if we could hear more from Google on this.

  2. Jay Harper Says:

    The ‘blog crawl’ is a crawl feeds, where as AdSense’s crawl is a crawl of actual pages. This means the AdSense crawl is essentially identical to the main googlebot crawl and hence it can be used interchangeably, whereas the feed crawl is fundamentally different since it’s a crawl of feeds, not pages.

    What would happen if you put a URL in your feed for a page that already existed in Google’s index and what the feed said about the page differed from what the main googlebot crawl found previously? Who’s to be believed? Now imagine you put out a feed with your links to your competitor’s pages that was completely specious and put it up on a third party site like Feedburner…

    Feeds are not pages - Google’s main index is an index of pages, not feeds… The feeds have to be validated against the page itself, hence the delay getting into the main index.

  3. Michael Gray Says:

    Well if people published full feeds like they should anyway, and people pinged google first, it would have the content to add to the search index, and also know who the originator was. Even if someone is scraping your feed they could never get it scraped republished and into the ping/crawl before you did. That’s actually an even more ideal solution.

  4. Jay Harper Says:

    What if I put out a feed that said this page was about kiddie porn, gambling, or free viagra? It wouldn’t be hard to do… The point is feeds aren’t the authoritative sources - pages are.

    In time Google will figure out authority as it relates to feeds and things will be better, but that time hasn’t come yet.