LinkChecker is a free, GPL licensed URL validator.
To check a URL like http://www.myhomepage.org/ it is enough to execute linkchecker http://www.myhomepage.org/. This will check the complete domain of www.myhomepage.org recursively. All links pointing outside of the domain are also checked for validity.
All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning, all other invalid syntax issues are errors. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.
HTTP links (http:, https:)
After connecting to the given HTTP server the given path or query is requested. All redirections are followed, and if user/password is given it will be used as authorization when necessary. Permanently moved pages issue a warning. All final HTTP status codes other than 2xx are errors.
Local files (file:)
A regular, readable file that can be opened is valid. A readable directory is also valid. All other files, for example device files, unreadable or non-existing files are errors.
File contents are checked for recursion.
Mail links (mailto:)
A mailto: link eventually resolves to a list of email addresses. If one address fails, the whole list will fail. For each mail address we check the following things:
FTP links (ftp:)
For FTP links we do:
Telnet links (telnet:)
We try to connect and if user/password are given, login to the given telnet server.
NNTP links (news:, snews:, nntp)
We try to connect to the given NNTP server. If a news group or article is specified, try to request it from the server.
Ignored links (javascript:, etc.)
An ignored link will only print a warning. No further checking will be made.
Here is a complete list of recognized, but ignored links. The most prominent of them should be JavaScript links.
Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order:
Note that the directory recursion reads all files in that directory, not just a subset like index.htm*.
Q: LinkChecker produced an error, but my web page is ok with Mozilla/IE/Opera/... Is this a bug in LinkChecker?
A: Please check your web pages first. Are they really ok? Use the --check-html option, or check if you are using a proxy which produces the error.
Q: I still get an error, but the page is definitely ok.
A: Some servers deny access of automated tools (also called robots) like LinkChecker. This is not a bug in LinkChecker but rather a policy by the webmaster running the website you are checking. Look the /robots.txt file which follows the robots.txt exclusion standard.
Q: How can I tell LinkChecker which proxy to use?
A: LinkChecker works transparently with proxies. In a Unix or Windows environment, set the http_proxy, https_proxy, ftp_proxy environment variables to a URL that identifies the proxy server before starting LinkChecker. For example
$ http_proxy="http://www.someproxy.com:3128"
$ export http_proxy
Q: The link “mailto:john@company.com?subject=Hello John” is reported as an error.
A: You have to quote special characters (e.g. spaces) in the subject field. The correct link should be “mailto:...?subject=Hello%20John” Unfortunately browsers like IE and Netscape do not enforce this.
Q: Has LinkChecker JavaScript support?
A: No, it never will. If your page is not working without JS, it is better checked with a browser testing tool like Selenium.
Q: Is LinkCheckers cookie feature insecure?
A: If a cookie file is specified, the information will be sent to the specified hosts. The following restrictions apply for LinkChecker cookies:
Q: I see LinkChecker gets a /robots.txt file for every site it checks. What is that about?
A: LinkChecker follows the robots.txt exclusion standard. To avoid misuse of LinkChecker, you cannot turn this feature off. See the Web Robot pages and the Spidering report for more info.
Q: How do I print unreachable/dead documents of my website with LinkChecker?
A: No can do. This would require file system access to your web repository and access to your web server configuration.
Q: How do I check HTML/XML/CSS syntax with LinkChecker?
A: Use the --check-html and --check-css options.