Recently on this blog I went through the process of moving from a dynamic URL structure to static looking URL structure (ie example.com?p=100 to example.com/foo/). Along the way I learned a little bit more about how wordpress works and discovered a way to use this to create duplicate content on someone else’s wordpress blog.
Before we get into the technical nitty gritty (we’ll get there I promise) let’s take a look at duplicate content. Duplicate content comes in two distinct flavors, internal and external. External duplicate content could be defined as having the same copy existing on one or more domains, that’s not what we’re talking about in this post. The second type of duplicate content would internal duplication or the same content or more than one page of your website (ie example.com/foo/ and example.com/bar/ ). According to the Google Webmaster Guidelines that’s not a good thing:
Don’t create multiple pages, subdomains, or domains with substantially duplicate content.
Now that we’ve go that covered lets get into the details of how to get it done. For an example were going to use Google’s very own spam assassin Matt Cutts. Let’s take a look at this URL
http://www.mattcutts.com/blog/my-name-is-inigo/
In it we see Matt dressed as Inigo Montoya off to fight evil spammers. However the exact same content can be found on these URL’s
http://www.mattcutts.com/blog/?p=75
http://www.mattcutts.com/blog/index.php?p=75
OK let’s take one step back, wordpress has a feature that allows you to change the dynamic URL’s you normally serve into more SE friendly URL’s. However when you activate the feature there isn’t a way to turn off the dynamic URL structure. So what’s the deal is this really bad? For the answer let’s look to Matt Cutts blog for the answer:
Canonicalization is the process of picking the best url when there are several choices, and it usually refers to home pages. For example, most people would consider these the same urls:
* www.example.com
* example.com/
* www.example.com/index.html
* example.com/home.aspBut technically all of these urls are different. A web server could return completely different content for all the urls above. When Google “canonicalizes†a url, we try to pick the url that seems like the best representative from that set.
Q: So how do I make sure that Google picks the url that I want?
A: One thing that helps is to pick the url that you want and use that url consistently across your entire site. For example, don’t make half of your links go to http://example.com/ and the other half go to http://www.example.com/ . Instead, pick the url you prefer and always use that format for your internal links.
All right so you overly nit picky readers will say hey GW isn’t this a canocalzation problem, and not a duplicate content one? Yeah I guess you could say so, but would you have read this far if the post was titled “How to Create a Canocalzation Problem on Someone Else’s Wordpress Blog“? Now I’m pretty sure folks at Google are able to sort out that example.com?p=100 and example.com/?p=100&c=more are the same URL, they may even be able to sort out example.com/?p=100 and example.com/index.php?p=100 are the same, although I wouldn’t put it into practice on a website I cared about. However I think showing them http://www.mattcutts.com/blog/my-name-is-inigo/ and http://www.mattcutts.com/blog/index.php?p=75 can create problems. So what now that I’ve linked to Matt’s blog and caused the spiders to find duplicate content will his site sink to the nether regions of the supplemental index never to be seen again. I think there’s a little more to it than that. To see any real difference you’d have to link to more than one page in that manner, and to get the job done right you’d want to make sure every page existed under as many URL’s as possible. Should take anyone with any sort of programming skills about 15 minutes if they typed with one hand tied behind their back, so it’s not a tool something only the elite black hat programmers have access to. Here’s another thing, some of you may have already inadvertently done this to yourselves. If you were serving dynamic URL’s and switched to static URL’s and didn’t use any mod rewrite rules your content is probably sitting out there under two URL’s now.
So is this really a problem, or have I just created this big bogey man? Well since Matt’s on vacation we can’t expect a clarification from him on the issue. However we’ll try dropping a link to the Google Sitemaps Blog and we’ll invoke the name of Adam Lasnik and see if he wanders by and can shed some light on the issue.
Related Information
- Google Webmaster Guidelines
- Matt Cutts: » SEO advice: url canonicalization
- mod_rewrite Cheat Sheet – Cheat Sheets – ILoveJackDaniels.com
- URL Rewriting | redirecting URLs with Apache’s mod_rewrite
- Duplicate Content Observation
Related posts:
- Yes Virginia You Can Hurt Yourself With Duplicate Content So last we
- How Google Treats Trusted Sites Differently With Duplicate Content One of the
- Two Wordpress Plugins That Need to be Developed After my S










{ 1 trackback }
{ 17 comments }
Doesn’t robots.txt solve this issue by doing:
User-agent: *
Disallow: *.php
User-agent: Googlebot
Disallow: *.php
?
I tried Google’s robots.txt validation tool in the sitemaps and it seams to be working well for this kind of cases.
301 redirects are your friends:
http://code.mincus.com/?p=3 -> http://code.mincus.com/3/adsense-notifier/
When i made the change there and on one other site, it took a little less than a month for the PR juice and search engine listings to switch over to only displaying the new structure. (see: http://code.mincus.com/10/url-schema-change-effect-on-unique-hits/ )
Without the redirect, I’d be afraid that the search engines wouldn’t know the two are connected and you’d lose some of the goodness you’ve built up from links.
I should probably put up a post with the rewrite map I used to do this for wordpress, let me know if you’d like to see it.
I’ve put some blocks in my robots.txt in the past to try and divert this – often (WP specific) you want to keep the bots out of the /page/…/ stuff that WP offers too.

In theory, htaccess entries could be written to put the bots back in the right place, but – geez, it would be easy to sent it in circles
It would be nice if WP offered a switch to allow these ‘extras’ top be turned off. Do we have anyone in WP developer country with a strong understanding of SEO and what these issues imply?
>max
might work but you’d have to be absolutely positive you never wanted a .php file indexed. They also have a nasty habit of crawling those pages and just not listing them, who knows if they compare them for duplication or not
>mincus
does the code work every time you create a new post automatically? If so I’d love to see how you’re doing it
>lea
yea it’s really easy to set things off kilter
http://code.mincus.com/29/wordpress-permalink-redirects/ is my quick writeup on how I do it. Although, I just recently ran across this plug-in http://fucoder.com/code/permalink-redirect/ – which seems like a lot easier to get going – although you miss out on all the fun of messing with mod_rewrite :-p
this plugin does proper 301 redirects to the permalink:
http://fucoder.com/code/permalink-redirect/
it works for url’s missing the trailing slash, like:
/my-name-is-inigo
and also for dynamic url’s like these:
/?p=75
/index.php?p=75
using the plugin, they’ll all be 301′ed to
/my-name-is-inigo/
I had the same problems when I swithced my url’s over. Nice plugins.
I hadn’t worried about it much, as it is my personal blog, and I am not too cncerned with how she performs in the engines. However, I have never seen any issues with any type of dupe content penalties running it without any of the rewrites.
Well, is it duplicated content from OTHER site or from the same site. I could have 1 billions page showing the same content on the same url but still get the billion to get index. I guess it’s more a matter of domain name than the fact that you can have the same page on your website.
Strange then – if the duplicate content issue is of concern, then why have the older blog platforms not destroyed many a blog?
My opinion is that duplicating your pages ONCE in your own domain won’t hurt you with the algos.
What Google means by dup content is 300 or so pages exactly the same except one or two words are changed.
I think the whole concern is not a concern. My Wordpress blogs are doing just fine?
BLOGGERS: This isn’t something I’d spend much time worrying about. If anything, simply endeavor to keep your own internal linking as consistent as possible (as Matt and others have wisely noted in the past). For instance, if you include permalinks to your entries in two different places (e.g., linked from the title and then in the entry footer), make sure they’re linking to exactly the same URL.
SPAMMERS: ’tisn’t worth your time
* * *
I think the issues discussed in this entry are interesting; however, here are some things to note:
- The “duplicate pages” are within the same domain.
- There’s no absolute penalty on pages perceived as duplicates. We do our best to display one copy of each relevant page to each search query. Understandably, even if content “x” is repeated twice on one domain and three times on another domain, it’s still (ideally) going to be shown just once to the user. That’s not a penalty, that’s optimal selection. There’s a difference.
- We do our best to determine intent. As you can imagine, we’re quite aware of tools like Movable Type, Wordpress, etc. There’s a difference between a user intentionally creating duplicate content (to “fool” search engines), malicious pranksters referring to multiple versions of the same content on someone else’s page, and inadvertant duplicate content created via templates included with or defaults of a tool.
- Our duplicate content filters don’t work in a “binary” fashion. There are (and you can appreciate this) shades of gray
(near-duplicate content, for instance).
- Lastly, there are many instances in which near-duplicate content is created by default (e.g., with various forum software’s “minimal” or “mobile” versions). This isn’t likely to help a site’s indexing or ranking, but neither is it likely to result in negative consequences.
Hope this helps clarify the issues.
And hey, what’s with the wp-smiley class and/or img attribute creating a newline after each smiley?
>- The “duplicate pages†are within the same domain.
>keep your own internal linking as consistent as possible
Definitely great logic to make internal linking decisions by.
>>- There’s no absolute penalty on pages perceived as >duplicates.
Wouldn’t two duplicate indexed pages within the same site dilute the “trickle down” link popularity between two pages rather than one? Wouldn’t there be MORE benefit to a site to just have ONE page indexed by virtue of garnering ALL the potential link popularity in the site? Couldn’t this diluted link popularity be PERCEIVED as a penalty?
>neither is it likely to result in negative consequences.
If I link to two pages with the same content from my homepage I would have to say that it WILL have negative consequences. I will waste the value of that internal link juice for a page that will never have value. I would be much better off just having one link to a page with the unique content, and prohibiting search engines from ever seeing the second link if it was absolutely necessary for my users.
If I’m passing off even 5% of my sites internal link juice to mobile pages/ printable pages or the like, I think that definitely IS a negative. That 5% might be what it takes to get me on the next page. This is a game of inches.
I agree that there is most likely no “penalty” for internal site duplicates, but I don’t think you can honestly say there wouldn’t be any negative consequences versus creating NO duplicate content within the site. Then again, I’m from the outside looking in, and could perhaps be missing something.
I’m not saying it ever happens…but serving ZERO duplicate content to search engines should be the IDEAL of most sites.
>-malicious pranksters referring to multiple versions of the same content on someone else’s page
>keep your own internal linking as consistent as possible
Hopefully those pranksters don’t take it a step further by referrencing different URLs containing the same content from within (UGC) and from outside someone else’s site.
I’m not sure but seems that the my permanent link in wordpress can be turned on and off..
Christiae, like greg i’m having some problems with the plugin though…not sure why
I’ve run into the duplicate content issue in several areas, particularly while submitting articles to hundreds of article “farms”. Initially I simply modified a word here, a phrase there. Still, it seemed that my content was too similar. Even though my articles were picked up by many blogs & info sites, they weren’t sufficiently different from each other to avoid the issue. I began using Website Content Wizard (I know some will disagree with its benefits), but found it works well for me. Yeah, it took quite a few “training” sessions to get it to work the way I wanted. Still it seems better than many other applications I have seen. JMO
Well, I just did it with a simple robots.txt file.
The way I did it is
User-agent: *
Disallow: *.php
Comments on this entry are closed.