Duplicate Text and SMX

Michael Gray

By Michael Gray
In Google  

Print Post Print Post Email Post Email Post    ADD TO STUMBLEUPON Sphinn It ADD TO DEL.ICIO.US  Tweet This

One of the people I finally had the chance to meet at SMX was Vanessa Fox of Google. We’ve been twittering for eva (ok maybe 6 months but on the internet that is forever) and I think we’re really close to becoming BFF.

Since I am somewhat critical of Google on a regular basis I thought it would be nice to point to some good things. I’m actually a really big advocate of Webmaster Central and not just in public or because Vanessa will pick this up because she’s ego surfing. You can even ask some of my darker shades of gray friends who will back me up on that.

Vanessa’s got a post up today (Official Google Webmaster Central Blog: Duplicate content summit at SMX Advanced) recapping some of the stuff they discussed at SMX. For the record I’d really like the ability to specify and ignore parameter in the URL, authenticate ownership, and get a duplicate content report.

Now since I’ve totally got you attention I’ll throw in a shameless question about a debate I’m having with a nameless colleague. if I put this in my robots.txt file

User-agent: *
Disallow: */print/

would it block these files

http://www.example.com/foo/print/
http://www.example.com/bar/print/
http://www.example.com/foo/bar/print/
http://www.example.com/bar/foo/print/

and allow these

http://www.example.com/foo/
http://www.example.com/bar/
http://www.example.com/foo/bar/
http://www.example.com/bar/foo/

Related posts:

  1. SMX Internet Marketers Charity Party Unfortunat
  2. How Google Treats Trusted Sites Differently With Duplicate Content One of the
  3. Yes Virginia You Can Hurt Yourself With Duplicate Content So last we

Crazyegg Link Tracking

{ 6 comments }

JLH June 13, 2007 at 2:30 pm

Use that hand-dandy webmaster central robots.txt testing tool to find out for sure before it goes live.

I’ve used that before to block out feeds from being indexed. Because there is nothing more irritating than searching for something and clicking the result and its a feed and not the real page.

John June 13, 2007 at 2:41 pm

While we’re in the robots.txt-mood, how do you prevent your own robots.txt (or sitemap-file) from being indexed? :-)

John June 13, 2007 at 2:48 pm

You might also want to refine that to:
Disallow: */print/$
… if you have pages like http://www.example.com/print/inkjet-cartridges/ that should be indexed.

However:
User-agent: *
Will not be fully followed. Only Google + Yahoo (and Ask?) support wildcards in the disallow-line. All other engines stick to the “standard”. However, not all engines will actually crawl all pages anyway, so it might not make much of a difference.

Glen June 13, 2007 at 3:01 pm
Ken Savage June 13, 2007 at 3:20 pm

Excellent question, Michael. I had a similar problem with a client that I had to add nofollow to their print page button. I wanted to make the markup clean by just adding it to robots.txt.

Come on Vanessa Fox! help us out by throwing up a comment here on Gray Wolf’s blog and curb his ego.

shor June 13, 2007 at 9:13 pm

Cool. Never knew Google allowed pattern matching with the ‘$’ character.
Thanks for providing my ‘nugget of gold’ for the day!

Comments on this entry are closed.