Michael Gray

How To Figure Out What Parts of Your Website Aren’t Being Crawled

Posted on September 10th, 2008
by Michael Gray in SEO



When Google took away the supplemental index last year, they killed one of the key diagnostic tools in the SEO’s toolbox, the ability to identify which parts of a site were unimportant (and being infrequently crawled) in a search engines eyes. However with the use of some structured text and clever searches I’m going to give you back that valuable information.

To tell what pats of a site google (or any search engine for that matter) thinks are important what you need is a way of date tagging when the last time a search engine visited/indexed the page. It’s not important to know that it was crawled 16:42:03 on September 8, 2008, it’s important to know that it’s been over 30/60/90 days since a search engine visited that page. What you need to do is put a month and year time stamp somewhere on every page, I recommend the footer, since proximity isn’t a factor. I’m also going to recommend keeping punctuation and special characters out of the time stamp, in my experience Google gets a bit unpredictable when you introduce those elemnets. I’d suggest you aim for something simple like “Sep 2008″. Next you need to add something that will only appear on your site. The most unique thing will be the site’s proper name (again omit any punctuation or special characters). So you’ll end up with something like “Joes Widget World Sep 2008″.

Once you’ve got that in place the next thing to is wait … at least two full months before you’ll get any good data … yes really. Lets assume you put this change into place today, then on December 1st you’d go to Google (or any other search engine) and type in the following query ["Joes Widget World Sep 2008"] (minus the brackets but with the quotes). The search engine of your choice will then spit out a list of pages with an exact match of the phrase Joes Widget World Sep 2008, or a list of pages that haven’t been crawled since September of 2008, over 60 days days ago … hopefully you just had a lightbulb moment …

One of the problems with this method is if you have pages that aren’t being crawled now it may be a while before they are crawled with the new date code keyword phrase stamp. Unfortunately there isn’t a 100% fool proof method for telling the search engines to deep crawl and re-index your whole site. The best reccomendation I have is to create a complete sitemap and resubmit it, it’s not foolproof or 100% effective, but in many cases it will help.

Once you have identified what pages aren’t being crawled what do you do with that information? The first thing I’d look at is whether the page is valuable or important. Sometimes site owners or publishers create pages that were important at the time but are now useless. For those pages I’d merge or delete them making sure to 301 them properly. If a page has value but isn’t being crawled the next thing I’d do is look to update the copy, and freshen it up a bit. Once that’s done I’d put a link to the page (at the same URL) on a what’s new/updated/revised/changed page (thats hopefully linked to from your homepage). If you’re using wordpress something like the recently updated code would automate the process. Moving the page closer to homepage should get it indexed again. The next thing you should do is look for ways to increase the internal linking to that page. Those steps should help keep the page in the index. I’d suggest keeping a log of your actions so you can see whats going on over time. If you find that the same pages keep re-appearing for these old datestamp searches, it probably an indication that there is something wrong with your architechture or internal linking.

It would be really cool if Google Alerts supported monthly frequency, so you could put the phrase in for every month of the year and automate the process and work smart not hard, but thats currently not an option.

Popularity: 14% [?]

Sphere: Related Content

Text Link Ads


21 Responses to “How To Figure Out What Parts of Your Website Aren’t Being Crawled”

  1. User GravatarAl Says:

    Michael,

    While not perfect, would you be satisfied knowing that spider x downloaded page y on date z?

    I realise that downloading a page doesn’t guarantee that it’ll be placed into the index, however if that information was enough - you could use web server logs to work out what has/hasn’t been downloaded recently. It’d be a lot faster and would avoid the long delay waiting for the search engines to kick into gear.

    Thoughts?

    Al.

  2. User GravatarMichael Gray Says:

    @Al: Agreed, but in many cases getting raw log files is a PIA. The amount of red tape you have to go to get log files once, never mind on a regular basis from nay IT departments is ridiculous. It the exact same reason people use JS tracking bugs instead of raw log files for analysis, you remove the IT department from the equation, and get quicker access to data which is only slightly less accurate.

  3. User GravatarPatrick Altoft Says:

    We have implemented a php tracking system that logs all spider visits and lets you see crawl data for any page on demand.

  4. User GravatarMichael Gray Says:

    @Patrick Altoft: obviosly that’s a much better solution, this is more of a quick down and dirty implementation that takes 5 minutes to do.

  5. User GravatarNat Arem Says:

    The php solution is easy too. User agent, date and URL -> insert into a database. Include the hitlogging script on every page. I think I implemented it in <5 minutes on my blog.

    I also put in variables such as referrer — actually, referrer was my primary curiosity for implementing it.

  6. User GravatarHowling for Supplemental Results « SEO Chatter: What’s the buzz, man? Says:

    [...] clipped from http://www.wolf-howl.com [...]

  7. User GravatarChat Man Says:

    Mike,

    Great, great recommendation! I really like that you (once again) spurred conversation that allowed others to contribute to a problem that needs some relatively solid solutions. I’m not a website programmer, per se, but I understand what Search Engines want to see (hence, my role as ‘consultant’), and I feel your technique gives me the coke-bottle-glasses needed to check out those sometimes-latent-sometimes-not pages!

    Digging through raw logs, as you said, can make the bum quite uncomfortable; a lack of extended PHP knowledge keeps me from building the dbase. I really, *really* love 5 minute solutions (hello!? template update: <3 minutes!), so you get mad points from me for this one! (and some link juice, too…)

    Thanks a bunch!
    SEO Chatter

  8. User Gravataryvonh Says:

    there is an excellent tool called Crawltrack. Written in php/mysql. But having a page getting crawled does not guaranty it been indexed. Isn’t the -allinurl operator useful to filter supplemental index?

  9. User GravatarJaan Kanellis Says:

    What makes you think that the list you get from Google in December will be anywhere near accurate? I am sure you dont Michael, but I just thought I would say it.

  10. User GravatarDebunking Michael Gray’s ineffective Google inclusion tracking method | Best SEO Blog Says:

    [...] Gray thinks he has figured out which parts of your Web site are not being crawled. He says put a date stamp on your pages. Combined with something unique to your site (he suggests the site name), you’ll create a [...]

  11. User GravatarSEO Team Reading List 9.11.08 » (EMP) E-Marketing Performance Says:

    [...] How To Figure Out What Parts of Your Website Aren’t Being Crawled Like this post? Subscribe to the RSS feed and get lots more! Leave a comment or trackback from your own site. Posted in SEO, Team Reading [...]

  12. User GravatarIs Google Crawling & Indexing All of My Pages? | Kooshy - Sneaky Search Marketing Says:

    [...] Gray has composed a post that helps SEOs find out which pages of their site haven’t been crawled, which becomes increasingly more important due to Google’s removal of the supplemental index. [...]

  13. User GravatarWeekly Search Buzz Roundup - 09/12/08: Google Satellite, Google News Archive & Yahoo September Traffic Increases | Kooshy - Sneaky Search Marketing Says:

    [...] If you want to see which pages are in Google’s “supplemental results”, follow the instructions provided by Michael Gray. [...]

  14. User GravatarSEO News & Interesting Links Says:

    [...] Gray on how to figure out what parts of your site are not being crawled regularly. Check out the comments on that post for more tips. If you use Wordpress, you might find this crawl [...]

  15. User Gravatar» Identifying Crawl Rates As Part Of SEO Search Engine Optimization Journal - SEO and Search Engine Marketing Blog Says:

    [...] Gray Wolf’s SEO Blog has a post that discusses the issue in some depth. The most difficult part of the concept is getting reliable crawl rate data. It’s not impossible and Gray Wolf’s post details one method. [...]

  16. User GravatarSEO News & Interesting Links | ThePagerank.com Says:

    [...] Gray o­­n h­o­w t­o­ figur­e­ o­ut­ wh­at­ par­t­s…. C­hec­k o­u­t the c­o­m­m­ents o­n that po­st [...]

  17. User GravatarSweta Says:

    Does Crawlers go through PDF files available for download?

  18. User GravatarURL Canonicalization: The Missing Manual | ThePagerank.com Says:

    [...] M­ic­h­ae­l­ Gr­ay at Wol­f-H­owl­” o­­ut­l­i­ne­s a me­t­ho­­d t­o­­ e­asi­l­y­ c­he­c­k fo­­r t­hi­s dat­a. I­n summary­, y­o­­u add a dat­e­ and uni­q­ue­ fi­e­l­d t­o­­ e­ac­h page­, wai­t­ a c­o­­upl­e­ o­­f mo­­nt­hs, t­he­n se­arc­h o­­n t­hi­s t­e­rm. [...]

  19. User GravatarURL Canonicalization: The Missing Manual | SEO Tips Mashup Says:

    [...] Michael Gray at Wolf-Howl” outlines a method to easily check for this data. In summary, you add a date and unique field to each page, wait a couple of months, then search on this term. [...]

  20. User GravatarA Playoff Worthy Lineup Of Links - This Month In SEO - 9/08 | TheVanBlog | Van SEO Design Says:

    [...] How To Figure Out What Parts of Your Website Aren’t Being Crawled [...]

  21. User GravatarHow Important is Branding to Search Engine Marketing?- AFX Blast Says:

    [...] Michael Gray at Wolf-Howl” outlines a method to easily check for this data. In summary, you add a date and unique field to each page, wait a couple of months, then search on this term. [...]

Flyclear Discount Code