404 Errors — Finding Errors for Missing Pages
Tagged web stats packages such as Google Analytics, WebTrends, and Coremetrics, are superior to their earlier server logfile stat packages in almost every way: richer data, easier maintenance, interoperability with other software packages, etc. Sure when the tag approach first started appearing on the scene, people were hesitant about installation issues — “You mean I have to modify every single page on my website!?!” Heck, I remember a meeting years ago when I explained to the Coremetrics founders how a Perl hacker could quickly write a script to add their JavaScript tags to all the files in a directory.
Installation issues largely turned out to be a red herring. People quickly came to understand that if it was hard to tag every page, it pointed to larger problems with the website’s information architecture. For modern web development with content management systems, templates, include files, wrappers, MVC architectures, or what have you, modifying every page on a website is almost too easy.
Still there is one area where logfile stats packages such as Analog, AwStats, and Webalizer, surpass their tagged counterparts — reporting server status codes. Was that page redirect a SEO-friendly moved permanently 301 or a nasty moved temporary 302? While answering this question is simple using a logfile stats package, it requires a bit of coding gymnastics for tagged packages. Since the server request for the originating page does not get delivered there is nowhere for the JavaScript tracking tag to be placed. If you are worried about search engine optimization (SEO), do no let out-of-sight become out-of-mind. Check your page redirects every month or so and after every major site redesign.
The out-of-site out-of-mind approach made easier by tagged web stats package has lead to an even more pervasive problem — the resurgence of the dastardly 404. Since I have already dated myself in this post, I can point out that back in the olden days the first thing I would do when I fired up my trusty Analog was review the 404 Missing Pages Report. If I found a broken internal link, I would have a conniption fit and immediately run a broken link checker tool. I would investigate every external broken link, and try to get the offending website to fix the problem. If I failed to get an external website to change, I would put up a dummy page with a meta-refresh to the proper page (dang those are like 302 redirects!).
Now back in the day, we were mainly concerned with stomping out 404 errors for usability reasons. Common wisdom among web designers/developers was that most people were new to the Internet, so if they came upon a missing page they would become flustered, curse you to hell, abandon your site, and never come back. In order to avoid this tragic chain of events, web folk learned to stomp 404 errors and create an attractive friendly 404 page for those we could not find ahead of time.
Current visitor reactions to missing pages might not be as severe as they once were, but missing pages still create a poor user experience. Human visitors, though, are not the only ones who might follow a broken link. Missing pages can also negatively impact SEO efforts. Some optimizers claim that a website with too many (whatever that means) missing pages will incur a site wide penalty. While I do not concur with this view, external links leading to missing pages can reduce PageRank. Not only does a missing page generally not have any PageRank, but it loses the opportunity to funnel PageRank to extant pages.
Based on my recent experience with SEO engagements, tagged web stats packages have lead to an explosion of missing pages. Generally, when I ask a website’s “tech person” to send me last months Missing Pages Report, I am met with a blank stare. After explaining the concept, several days later I get emailed a file so large that it has to be zipped to get by the email filters. Amazing.
So, let’s fix the problem. The first step to stomping out 404 errors is find out which pages are missing. You can run a link checker program, but that will only find internal errors. You can use Google’s Webmaster Tools, but they don’t tell you which missing pages are the most frequently attempted or the referring source. You can go old school and fire up a logfile stats program which works well and also allows you to check your redirects. The problem, though, with the logfile stat approach is that if you use it for just one purpose it is hard to incorporate into your normal routine.
My preferred approach to raise the visibility of missing pages by incorporating them into your tagged stats package. Most tagged stats packages allow you to explicitly generate a pageview with whatever address you want. Using this feature with a little JavaScript allows one to attached to the address the missing page and the referrer. Google Analytics has a good help page explaining how to configure the tracking tag for your custom 404 page.
pageTracker._trackPageview(”/404.html?page=” + document.location.pathname + document.location.search + “&from=” + document.referrer);
Once you’ve modified the tracking code on your custom 404 page, you can search for 404 in the Top Content report.

Configuring your tagged stats package to record missing pages and their referrers works pretty well. Once you have the relative frequency of each missing page, you can set priorities for fixing broken links on your own site, contacting other sites, creating 301 redirects to the closest matching page, etc. It could work even better, though. Google Analytics and the other tagged stats packages should really create an explicit missing pages report that can be added to dashboards. The more people know about missing page, the more likely they will be to stomp them out and improve the Internets for all of us.
Explore posts in the same categories: web analytics
September 22nd, 2008 at 4:05 pm
Until now, I have used Xenu LinkSleuth to crawl a site and to then find all broken internal links, and the Crawl Errors reporting within Google WMT to find incoming broken links.
Your method adds yet one more tool to the armoury to be used in chasing out technical problems within a website.