Tech Talk: Detect broken links on a Web site using wget

Friday, August 6, 2010

Detect broken links on a Web site using wget

The other day, I started thinking about writing a simple Web site validator that detects broken links on a Web site, similar to W3C Link Checker. After looking at various code samples in Java, Ruby, etc, I figured out that GNU wget 1.12 on my Linux machine could do the job just fine, with no programming required. It even detected broken resource links in CSS, not just broken <a> links.

Here is how to write a simple script to check the site. First, pretend to be a Mozilla-based browser and spider the site to the depth of one level:


wget --spider -r -l 1 --header='User-Agent: Mozilla/5.0' \
-o wget_errors.txt http://the_site_i_want_to_validate

Then, simply look at the return code to determine if there is any error. If the code is larger than zero, there is an error.


EXIT_CODE=$?
if [ $EXIT_CODE -gt 0 ]; then
    echo "ERROR: Found broken link(s)"
    exit 1
fi

To find out the actual links in question, just grep for 404 in the wget error log.


BROKEN_LINKS=`grep -B 2 '404' wget_errors.txt`

The -B 2 outputs the 2 lines above any matching line, which in this case contains the broken link in question.

9 comments:

TeemuTAugust 7, 2011 at 4:45 AM
I've noticed that very often when I start to think of a hack I should code, after googling around for some time I realize my Linux box already has installed the tools necesssary for the job :)
ReplyDelete
Replies
AnonymousSeptember 14, 2012 at 6:31 AM
If you are stuck on a Windows box and have the option to install PowerShell you can use Select-String to get similar results as the 'grep -B' command above.

Use a Windows version of wget with the same switches as above.

From the PowerShell command line (at the log file path):
>Get-Content .\wget-errors.txt | Select-String -context 2,0 -pattern -allmatches " 404 Not Found"

In this example the -context switch will get the 2 lines above the matching line. (and zero lines below the matching line)

If you just wanted the list of URLS you can do additional filtering using regular expressions instead of a simple text pattern. If you redirected the output from the previous command to a file called 404.txt you could:

>Get-Content .\404.txt | Select-String -pattern "http://\S+" -allmatches | select matches
ReplyDelete
Replies
UnknownJanuary 5, 2013 at 6:42 AM
Can you get the parent page?
ReplyDelete
Replies
Alexander YapJanuary 11, 2013 at 8:55 AM
If you own the site, you can simply run the above script over all the pages to find the pages with broken links.
ReplyDelete
Replies

Add comment

About Me

Friday, August 6, 2010

Detect broken links on a Web site using wget

9 comments: