Delving

Delving is an optional part of the automatic cache refresh feature. Most Web pages have links to other pages with related information, and users often follow the path linking from one page to another and from one site to another. Delving is a way to cache these logical information paths. In delving, the cache agent follows a specified level of hypertext (HTML) links on the pages it is loading, and also caches all of those linked pages. The linked pages can reside on the same host as the source page or on other hosts. An illustration is shown in Figure 1.

Figure 1. Delving
Delving

To control the delving process, the administrator specifies to the cache agent a maximum number of URLs that it can load (the default setting is 2000), a maximum length of time it can run (the default setting is two hours), and a maximum number of threads it can use (the default setting is four). The administrator can also configure additional controls. By default, delving is enabled for two levels of hierarchy and is not allowed across hosts. Additionally, a delay is inserted between requests. To change these settings, see Related proxy configuration file directives.

The cache agent loads and then refreshes the cache in this order:

  1. It loads specific pages that the administrator has specified.
  2. It loads popular (frequently accessed) pages from the cache access log.
  3. If the maximum number of pages is not reached at this point, additional pages are loaded by delving.

Note that the cache agent does not check whether the maximum number of pages has been reached until it starts delving across links. If the value for the maximum number of pages (called MaxURLs in the proxy configuration file) is lower than the number of pages retrieved in steps 1 and 2, no linked pages are retrieved.

The following examples show how the cache agent handles cache refresh priorities and delving, relative to the maximum number of URLs that are specified (assume that delving is configured for all of these examples).

Configuration file setting Result
LoadURL 
 http://www.getthis.com/main.html
LoadURL 
 http://www.getmetoo.com/welcome.htm
LoadTopCached 30
MaxURLs 50
If the Cache Access Log has more than 30 unique URLs, the cache agent retrieves main.html, welcome.htm, and the top 30 requested URLs based on the cache access log. Because it has not reached the MaxURLs value, it retrieves and loads up to 18 linked URLs from pages already cached.
LoadURL 
 http://ww.joesmith.edu/favorites.html
LoadURL
 http://www.janesmith.edu/dislikes.html
LoadTopCached 30
MaxURLs 25
If the cache access log has more than 30 unique URLs, the cache agent retrieves favorites.html, dislikes.html, and the top 30 requested URLs from the cache access log. No other files are retrieved because the value in MaxURLs has been exceeded.
LoadURL http://www.hello.com/hi.htm
LoadURL  
 http://www.ballyhoo.com/index.html
LoadTopCached 20
MaxURLs 25
If the cache access log has more than 20 unique URLs, the cache agent retrieves hi.htm, index.html, the top 20 requested URLs from the cache access log, and up to 3 linked URLs from the earlier pages. No other files are retrieved because the value in MaxURLs has been reached.