Friday, August 6, 2010

Detect broken links on a Web site using wget

The other day, I started thinking about writing a simple Web site validator that detects broken links on a Web site, similar to W3C Link Checker. After looking at various code samples in Java, Ruby, etc, I figured out that GNU wget 1.12 on my Linux machine could do the job just fine, with no programming required. It even detected broken resource links in CSS, not just broken <a> links.

Here is how to write a simple script to check the site. First, pretend to be a Mozilla-based browser and spider the site to the depth of one level:

wget --spider -r -l 1 --header='User-Agent: Mozilla/5.0' \
-o wget_errors.txt http://the_site_i_want_to_validate


Then, simply look at the return code to determine if there is any error. If the code is larger than zero, there is an error.

EXIT_CODE=$?
if [ $EXIT_CODE -gt 0 ]; then
echo "ERROR: Found broken link(s)"
exit 1
fi


To find out the actual links in question, just grep for 404 in the wget error log.

BROKEN_LINKS=`grep -B 2 '404' wget_errors.txt`


The -B 2 outputs the 2 lines above any matching line, which in this case contains the broken link in question.

9 comments:

  1. I've noticed that very often when I start to think of a hack I should code, after googling around for some time I realize my Linux box already has installed the tools necesssary for the job :)

    ReplyDelete
  2. If you are stuck on a Windows box and have the option to install PowerShell you can use Select-String to get similar results as the 'grep -B' command above.

    Use a Windows version of wget with the same switches as above.

    From the PowerShell command line (at the log file path):
    >Get-Content .\wget-errors.txt | Select-String -context 2,0 -pattern -allmatches " 404 Not Found"

    In this example the -context switch will get the 2 lines above the matching line. (and zero lines below the matching line)

    If you just wanted the list of URLS you can do additional filtering using regular expressions instead of a simple text pattern. If you redirected the output from the previous command to a file called 404.txt you could:

    >Get-Content .\404.txt | Select-String -pattern "http://\S+" -allmatches | select matches

    ReplyDelete
  3. Replies
    1. I believe so unless you specify the --no-parent option.

      Delete
    2. I think he means the page that links to the broken link.

      Delete
    3. If the link is broken, then there is no link between the two pages. There is no way to figure out what the intended link URL should've been.

      Delete
    4. ...but it would be nice to know which page contains the broken link so it can be fixed.

      Delete
    5. If you own the site, Check the logs, You will have a referrer link. Thats the page which linked to the broken link. If its an external site it becomes harder

      Delete
  4. If you own the site, you can simply run the above script over all the pages to find the pages with broken links.

    ReplyDelete