If you want to a tool to crawl through your site looking for 404 or 500 errors, there are online tools (e.g. The W3C’s online link checker), browser plugins for Firefox and Chrome, or windows programs like Xenu’s Link Sleuth.
A unix link checker
Today I found linkchecker - available as a unix command-line program (although it also has a GUI or a web interface).
Install the command-line tool
You can install the command-line tool simply on Ubuntu:
sudo apt-get install linkchecker
Using linkchecker
Like any good command-line program, it has a manual page, but it can be a bit daunting to read, so I give some shortcuts below.
By default, linkchecker
will give you a lot of warnings. It’ll warn you for any links that result in 301s, as well as all 404s, timeouts, etc., as well as giving you status updates every second or so.
Robots.txt
linkchecker will not crawl a website that is disallowed by a robots.txt file, and there’s no way to override that. The solution is to change the robots.txt
file to allow linkchecker through:
User-Agent: *
Disallow: /
User-Agent: LinkChecker
Allow: /
Redirecting output
linkchecker
seems to be expecting you to redirect its output to a file. If you do so, it will only put the actual warnings and errors in the file, and report status to the command-line:
$ linkchecker http://example.com > siteerrors.log
35 URLs active, 0 URLs queued, 13873 URLs checked, runtime 1 hour, 51 minutes
Timeout
If you’re testing a development site, it’s quite likely it will be fairly slow to respond and linkchecker
may experience many timeouts, so you probably want to up that timeout time:
$ linkchecker --timeout=300 http://example.com > siteerrors.log
Ignore warnings
I don’t know about you, but the sites I work on have loads of errors. I want to find 404s and 50*s before I worry about redirect warnings.
$ linkchecker --timeout=300 --no-warnings http://example.com > siteerrors.log
Output type
The default text
output is fairly verbose. For easy readability, you probably want the logging to be in CSV format:
$ linkchecker --timeout=300 --no-warnings -ocsv http://example.com > siteerrors.csv
Other options
If you find and fix all your basic 404 and 50* errors, you might then want to turn warnings back on (remove --no-warnings
) and start using --check-html
and --check-css
.
Checking websites with OpenID (2014-04-17 update)
Today I had to use linkchecker
to check a site which required authentication with Canonical’s OpenID system. To do this, a StackOverflow answer helped me immensely.
I first accessed the site as normal with Chromium, opened the console window and dumped all the cookies that were set in that site:
> document.cookie
"__utmc="111111111"; pysid=1e53e0a04bf8e953c9156ea841e41157;"
I then saved these cookies in cookies.txt
in a format that linkchecker will understand:
Host:example.com
Set-cookie: __utmc="111111111"
Set-cookie: pysid="1e53e0a04bf8e953c9156ea841e41157"
And included it in my linkchecker
command with --cookiefile
:
linkchecker --cookiefile=cookies.txt --timeout=300 --no-warnings -ocsv http://example.com > siteerrors.csv
Use it!
If you work on a website of any significant size, there are almost certainly dozens of broken links and other errors. Link checkers will crawl through the website checking each link for errors.
Link checking your website may seem obvious, but in my experience hardly any dev teams do it regularly.
You might well want to use linkchecker
to do automated link checking! I haven’t implemented this yet, but I’ll try to let you know when I do.