How to use unix linkchecker to thoroughly check any site

If you want to a tool to crawl through your site looking for 404 or 500 errors, there are online tools (e.g. The W3C’s online link checker), browser plugins for Firefox and Chrome, or windows programs like Xenu’s Link Sleuth.

Today I found linkchecker - available as a unix command-line program (although it also has a GUI or a web interface).

Install the command-line tool

You can install the command-line tool simply on Ubuntu:

sudo apt-get install linkchecker

Using linkchecker

Like any good command-line program, it has a manual page, but it can be a bit daunting to read, so I give some shortcuts below.

By default, linkchecker will give you a lot of warnings. It’ll warn you for any links that result in 301s, as well as all 404s, timeouts, etc., as well as giving you status updates every second or so.

Robots.txt

linkchecker will not crawl a website that is disallowed by a robots.txt file, and there’s no way to override that. The solution is to change the robots.txt file to allow linkchecker through:

User-Agent: *
Disallow: /
User-Agent: LinkChecker
Allow: /

Redirecting output

linkchecker seems to be expecting you to redirect its output to a file. If you do so, it will only put the actual warnings and errors in the file, and report status to the command-line:

$ linkchecker http://example.com > siteerrors.log
35 URLs active,     0 URLs queued, 13873 URLs checked, runtime 1 hour, 51 minutes

Timeout

If you’re testing a development site, it’s quite likely it will be fairly slow to respond and linkchecker may experience many timeouts, so you probably want to up that timeout time:

$ linkchecker --timeout=300 http://example.com > siteerrors.log

Ignore warnings

I don’t know about you, but the sites I work on have loads of errors. I want to find 404s and 50*s before I worry about redirect warnings.

$ linkchecker --timeout=300 --no-warnings http://example.com > siteerrors.log

Output type

The default text output is fairly verbose. For easy readability, you probably want the logging to be in CSV format:

$ linkchecker --timeout=300 --no-warnings -ocsv http://example.com > siteerrors.csv

Other options

If you find and fix all your basic 404 and 50* errors, you might then want to turn warnings back on (remove --no-warnings) and start using --check-html and --check-css.

Checking websites with OpenID (2014-04-17 update)

Today I had to use linkchecker to check a site which required authentication with Canonical’s OpenID system. To do this, a StackOverflow answer helped me immensely.

I first accessed the site as normal with Chromium, opened the console window and dumped all the cookies that were set in that site:

> document.cookie
"__utmc="111111111"; pysid=1e53e0a04bf8e953c9156ea841e41157;"

I then saved these cookies in cookies.txt in a format that linkchecker will understand:

Host:example.com
Set-cookie: __utmc="111111111"
Set-cookie: pysid="1e53e0a04bf8e953c9156ea841e41157"

And included it in my linkchecker command with --cookiefile:

linkchecker --cookiefile=cookies.txt --timeout=300 --no-warnings -ocsv http://example.com > siteerrors.csv

Use it!

If you work on a website of any significant size, there are almost certainly dozens of broken links and other errors. Link checkers will crawl through the website checking each link for errors.

Link checking your website may seem obvious, but in my experience hardly any dev teams do it regularly.

You might well want to use linkchecker to do automated link checking! I haven’t implemented this yet, but I’ll try to let you know when I do.

By @nottrobin