One of the people I finally had the chance to meet at SMX was Vanessa Fox of Google. We’ve been twittering for eva (ok maybe 6 months but on the internet that is forever) and I think we’re really close to becoming BFF.
Since I am somewhat critical of Google on a regular basis I thought it would be nice to point to some good things. I’m actually a really big advocate of Webmaster Central and not just in public or because Vanessa will pick this up because she’s ego surfing. You can even ask some of my darker shades of gray friends who will back me up on that.
Vanessa’s got a post up today (Official Google Webmaster Central Blog: Duplicate content summit at SMX Advanced) recapping some of the stuff they discussed at SMX. For the record I’d really like the ability to specify and ignore parameter in the URL, authenticate ownership, and get a duplicate content report.
Now since I’ve totally got you attention I’ll throw in a shameless question about a debate I’m having with a nameless colleague. if I put this in my robots.txt file
User-agent: *
Disallow: */print/
would it block these files
http://www.example.com/foo/print/
http://www.example.com/bar/print/
http://www.example.com/foo/bar/print/
http://www.example.com/bar/foo/print/
and allow these
http://www.example.com/foo/
http://www.example.com/bar/
http://www.example.com/foo/bar/
http://www.example.com/bar/foo/
Related posts:
- SMX Internet Marketers Charity Party Unfortunat
- How Google Treats Trusted Sites Differently With Duplicate Content One of the
- Yes Virginia You Can Hurt Yourself With Duplicate Content So last we










{ 6 comments }
Use that hand-dandy webmaster central robots.txt testing tool to find out for sure before it goes live.
I’ve used that before to block out feeds from being indexed. Because there is nothing more irritating than searching for something and clicking the result and its a feed and not the real page.
While we’re in the robots.txt-mood, how do you prevent your own robots.txt (or sitemap-file) from being indexed?
You might also want to refine that to:
Disallow: */print/$
… if you have pages like http://www.example.com/print/inkjet-cartridges/ that should be indexed.
However:
User-agent: *
Will not be fully followed. Only Google + Yahoo (and Ask?) support wildcards in the disallow-line. All other engines stick to the “standard”. However, not all engines will actually crawl all pages anyway, so it might not make much of a difference.
http://www.webmasterworld.com/robots_txt/3295267.htm
Excellent question, Michael. I had a similar problem with a client that I had to add nofollow to their print page button. I wanted to make the markup clean by just adding it to robots.txt.
Come on Vanessa Fox! help us out by throwing up a comment here on Gray Wolf’s blog and curb his ego.
Cool. Never knew Google allowed pattern matching with the ‘$’ character.
Thanks for providing my ‘nugget of gold’ for the day!
Comments on this entry are closed.