<a>
links. Here is how to write a simple script to check the site. First, pretend to be a Mozilla-based browser and spider the site to the depth of one level:
wget --spider -r -l 1 --header='User-Agent: Mozilla/5.0' \
-o wget_errors.txt http://the_site_i_want_to_validate
Then, simply look at the return code to determine if there is any error. If the code is larger than zero, there is an error.
EXIT_CODE=$?
if [ $EXIT_CODE -gt 0 ]; then
echo "ERROR: Found broken link(s)"
exit 1
fi
To find out the actual links in question, just grep for 404 in the wget error log.
BROKEN_LINKS=`grep -B 2 '404' wget_errors.txt`
The
-B 2
outputs the 2 lines above any matching line, which in this case contains the broken link in question.
I've noticed that very often when I start to think of a hack I should code, after googling around for some time I realize my Linux box already has installed the tools necesssary for the job :)
ReplyDeleteIf you are stuck on a Windows box and have the option to install PowerShell you can use Select-String to get similar results as the 'grep -B' command above.
ReplyDeleteUse a Windows version of wget with the same switches as above.
From the PowerShell command line (at the log file path):
>Get-Content .\wget-errors.txt | Select-String -context 2,0 -pattern -allmatches " 404 Not Found"
In this example the -context switch will get the 2 lines above the matching line. (and zero lines below the matching line)
If you just wanted the list of URLS you can do additional filtering using regular expressions instead of a simple text pattern. If you redirected the output from the previous command to a file called 404.txt you could:
>Get-Content .\404.txt | Select-String -pattern "http://\S+" -allmatches | select matches
Can you get the parent page?
ReplyDeleteI believe so unless you specify the --no-parent option.
DeleteI think he means the page that links to the broken link.
DeleteIf the link is broken, then there is no link between the two pages. There is no way to figure out what the intended link URL should've been.
Delete...but it would be nice to know which page contains the broken link so it can be fixed.
DeleteIf you own the site, Check the logs, You will have a referrer link. Thats the page which linked to the broken link. If its an external site it becomes harder
DeleteIf you own the site, you can simply run the above script over all the pages to find the pages with broken links.
ReplyDelete