When Google took away the supplemental index last year, they killed one of the key diagnostic tools in the SEO’s toolbox, the ability to identify which parts of a site were unimportant (and being infrequently crawled) in a search engines eyes. However with the use of some structured text and clever searches I’m going to give you back that valuable information.
To tell what pats of a site google (or any search engine for that matter) thinks are important what you need is a way of date tagging when the last time a search engine visited/indexed the page. It’s not important to know that it was crawled 16:42:03 on September 8, 2008, it’s important to know that it’s been over 30/60/90 days since a search engine visited that page. What you need to do is put a month and year time stamp somewhere on every page, I recommend the footer, since proximity isn’t a factor. I’m also going to recommend keeping punctuation and special characters out of the time stamp, in my experience Google gets a bit unpredictable when you introduce those elemnets. I’d suggest you aim for something simple like “Sep 2008″. Next you need to add something that will only appear on your site. The most unique thing will be the site’s proper name (again omit any punctuation or special characters). So you’ll end up with something like “Joes Widget World Sep 2008″.
Once you’ve got that in place the next thing to is wait … at least two full months before you’ll get any good data … yes really. Lets assume you put this change into place today, then on December 1st you’d go to Google (or any other search engine) and type in the following query ["Joes Widget World Sep 2008"] (minus the brackets but with the quotes). The search engine of your choice will then spit out a list of pages with an exact match of the phrase Joes Widget World Sep 2008, or a list of pages that haven’t been crawled since September of 2008, over 60 days days ago … hopefully you just had a lightbulb moment …
One of the problems with this method is if you have pages that aren’t being crawled now it may be a while before they are crawled with the new date code keyword phrase stamp. Unfortunately there isn’t a 100% fool proof method for telling the search engines to deep crawl and re-index your whole site. The best reccomendation I have is to create a complete sitemap and resubmit it, it’s not foolproof or 100% effective, but in many cases it will help.
Once you have identified what pages aren’t being crawled what do you do with that information? The first thing I’d look at is whether the page is valuable or important. Sometimes site owners or publishers create pages that were important at the time but are now useless. For those pages I’d merge or delete them making sure to 301 them properly. If a page has value but isn’t being crawled the next thing I’d do is look to update the copy, and freshen it up a bit. Once that’s done I’d put a link to the page (at the same URL) on a what’s new/updated/revised/changed page (thats hopefully linked to from your homepage). If you’re using wordpress something like the recently updated code would automate the process. Moving the page closer to homepage should get it indexed again. The next thing you should do is look for ways to increase the internal linking to that page. Those steps should help keep the page in the index. I’d suggest keeping a log of your actions so you can see whats going on over time. If you find that the same pages keep re-appearing for these old datestamp searches, it probably an indication that there is something wrong with your architechture or internal linking.
It would be really cool if Google Alerts supported monthly frequency, so you could put the phrase in for every month of the year and automate the process and work smart not hard, but thats currently not an option.
Related posts:
- Can You Get a Website Indexed with No Links and XML Sitemaps? This weeke
- New Website ViralConversations.com So if you
- Looking for a Chumby Calendar Widget One of the










{ 13 trackbacks }
{ 9 comments }
Michael,
While not perfect, would you be satisfied knowing that spider x downloaded page y on date z?
I realise that downloading a page doesn’t guarantee that it’ll be placed into the index, however if that information was enough – you could use web server logs to work out what has/hasn’t been downloaded recently. It’d be a lot faster and would avoid the long delay waiting for the search engines to kick into gear.
Thoughts?
Al.
@Al: Agreed, but in many cases getting raw log files is a PIA. The amount of red tape you have to go to get log files once, never mind on a regular basis from nay IT departments is ridiculous. It the exact same reason people use JS tracking bugs instead of raw log files for analysis, you remove the IT department from the equation, and get quicker access to data which is only slightly less accurate.
We have implemented a php tracking system that logs all spider visits and lets you see crawl data for any page on demand.
@Patrick Altoft: obviosly that’s a much better solution, this is more of a quick down and dirty implementation that takes 5 minutes to do.
The php solution is easy too. User agent, date and URL -> insert into a database. Include the hitlogging script on every page. I think I implemented it in <5 minutes on my blog.
I also put in variables such as referrer — actually, referrer was my primary curiosity for implementing it.
Mike,
Great, great recommendation! I really like that you (once again) spurred conversation that allowed others to contribute to a problem that needs some relatively solid solutions. I’m not a website programmer, per se, but I understand what Search Engines want to see (hence, my role as ‘consultant’), and I feel your technique gives me the coke-bottle-glasses needed to check out those sometimes-latent-sometimes-not pages!
Digging through raw logs, as you said, can make the bum quite uncomfortable; a lack of extended PHP knowledge keeps me from building the dbase. I really, *really* love 5 minute solutions (hello!? template update: <3 minutes!), so you get mad points from me for this one! (and some link juice, too…)
Thanks a bunch!
SEO Chatter
there is an excellent tool called Crawltrack. Written in php/mysql. But having a page getting crawled does not guaranty it been indexed. Isn’t the -allinurl operator useful to filter supplemental index?
What makes you think that the list you get from Google in December will be anywhere near accurate? I am sure you dont Michael, but I just thought I would say it.
Does Crawlers go through PDF files available for download?
Comments on this entry are closed.