Configuring the cache agent for automatic refreshing and preloading

Most caching proxy servers cache a file only after a user requests it. Caching Proxy has a cache agent that provides automatic cache preloading. You can specify that the cache agent automatically retrieves specified URLs, the most popular URLs, or both, and places them in the cache before they are requested.

In some cases, you need to set the host name of the proxy server and identify the cache access log before the cache is preloaded. To configure the cache agent, in the Configuration and Administration forms, select Cache Configuration and use the Cache Preload and Cache Refresh forms. The files representing query results (that is, files whose URLs include the question mark character (?) are cached only if query caching is enabled).

Automatic cache refreshing and preloading provides the following advantages:
  • Caching is applied to specified URLs before a user requests the pages.
  • The cache is populated before the server becomes busy with user activity.
  • Current® files are supplied to users more quickly from the cache than if they were fetched on the first request.
Disadvantages include the following:
  • The proxy server is busy caching pages even during hours of low user activity.
  • You must exercise some control over what is automatically loaded. Loading linked files from high-level pages, such as web indexes and search sites, can generate requests for many pages.

For optimal efficiency, set the cache agent to run when server activity is low and before the server becomes busy with client requests. Then, the files are ready in the cache to provide fast service the first time that a user requests them. By default, the cache agent is started every night at 3 a.m. local time.

Special considerations for reverse proxy configurations:

For security reasons, when you use a reverse proxy configuration, the Proxy http:* rule should be disabled, by default. (That is, this rule is commented in the ibmproxy.conf file.) However, if the rule is disabled, the cache agent is prevented from successfully sending requests and refreshing the cache content of Caching Proxy. A 403 Forbidden By Rule Error in the error log results and refreshing the cache does not complete.

To avoid this problem, use cacheAgentService, which is an internal service that is provided by the Caching Proxy. To enable the service, put the following Service directive before any other mapping rules in the ibmproxy.conf file:
Service   /any-valid-string*  INTERNAL:cacheAgentService

The variable any-valid-string is any string that is valid and that does not conflict with other mapping rules in the ibmproxy.conf file.

Both Caching Proxy and cache agent parse the URI based on this service directive. Instead of sending the URI directly to Caching Proxy, the cache agent utility adds a prefix to the URI with the /any-valid-string pattern in the service directive.

For example, the cache agent transforms the following URI:
http://www.ibm.com/
to
/any-valid-string/http://www.ibm.com/

The cache agent sends the URI with the prefix to Caching Proxy. When Caching Proxy receives the request, it removes the prefix /any-valid-string/. If the remaining URI is a fully qualified unit, Caching Proxy directly serves the request without mapping the URI against other rules.

Additionally, the cache agent can send a relative URI to Caching Proxy. For example, if you add LoadURL /abc/ by using the previously referenced service directive in the ibmproxy.conf file, the cache agent transforms it into /any-valid-string/abc/ and sends it to Caching Proxy. Caching Proxy receives the URL, removes the prefix, maps /abc/ against other mapping rules, and handles the request if there is a match.

Setting the server host name

On Linux and UNIX operating, specify the host name of the proxy server whose cache is being preloaded or refreshed. On Windows operating systems, specify the host name only if the proxy server being refreshed is not on the local machine (Refreshing a remote server's cache that is based on its most frequently accessed files is not possible because the local cache agent does not have access to a remote server's cache access log.)

To set the host name of the proxy server, in the Configuration and Administration forms, select Cache Configuration –> Cache Refresh: Identify cache destination server.

Preloading the cache with specific files

To preload the cache with the content stored at specific URLs, in the Configuration and Administration forms, use Cache Configuration –> Cache Preload. In this form, you can specify URLs for the cache agent to load. The proxy retrieves those pages when the cache agent starts, regardless of whether they were in the cache previously (These URLs are specified in the proxy configuration file by the LoadURL directive). This form can also be used to define URLs whose content is never cached. Access to a cache access log is not required for this type of cache preloading.

Use the Cache Preload form to configure the following options:
  • Refresh the cache daily—Check this box if you want the cache agent to refresh the cache every night. If you do not want to start the cache agent, make sure that this box is not checked.
  • Cache refresh time—If you want the cache agent to run at a time other than 3:00 a.m. local time, specify when you want it to start.
  • Cache Contents—In the URL or IP Address field, specify the URLs to load. To exclude URLs from being preloaded, specify the URLs and click Ignore in the Cache status box.

Preloading the cache with frequently cached files

To preload the most frequently accessed pages automatically, use the Cache Configuration –> Cache Refresh form. This function requires a Cache Access Log for the proxy server. The most popular URLs are determined automatically from the Cache Access Log. The administrator can also specify the number of frequently accessed pages to preload in the cache. (This number is specified in the proxy configuration file by the LoadTopCached directive.)

Use the Cache Refresh form to configure the following options:
  • Refresh the cache daily—Check this box if you want the cache agent to refresh the cache every night. If you do not want to start the cache agent, make sure that this box is cleared.
  • Cache refresh time—If you want the cache agent to run at a time other than 3:00 a.m., specify the hour and minute when you want it to start.
  • Identify cache destination server—Use this option if you want to refresh a server other than the local machine. (You cannot refresh a remote server that is based on the frequency of access to specific files.)
  • Cache the most popular URLs—Specify the number of URLs to cache from the previous night's cache access log.
  • Load linked pages—Use this setting to configure delving (see the following section for details on delving). Set the number of levels to delve, and whether to delve for all pages (always), no pages (never), administrator-specified pages only (admin), or popular pages only (topn). Also, specify whether to delve across hosts, whether to delay between requests, and whether to cache inline images.
  • Number of threads—Set the maximum number of threads to use for cache refreshing.
  • Maximum work queue depth—Set the maximum queue for URLs to request.
  • Maximum URLs to request—Set the maximum number of pages to load. This number is checked before delving page retrieval begins.
  • Maximum time—Set the maximum time to run the cache agent. If this time is set to 0 hours 0 minutes, the cache agent runs to completion.

Delving

Delving is an optional part of the automatic cache refresh feature. Most web pages have links to other pages with related information, and users often follow the path linking from one page to another and from one site to another. Delving is a way to cache these logical information paths. In delving, the cache agent follows a specified level of hypertext (HTML) links on the pages it is loading, and also caches all of those linked pages. The linked pages can exist on the same host as the source page or on other hosts. An illustration is shown in Figure 1.
Figure 1. Delving
Delving

To control the delving process, the administrator specifies to the cache agent a maximum number of URLs that it can load (the default setting is 2000), a maximum length of time it can run (the default setting is 2 hours), and a maximum number of threads it can use (the default setting is four). The administrator can also configure more controls. By default, delving is enabled for two levels of hierarchy and is not allowed across hosts. Additionally, a delay is inserted between requests.

The cache agent loads and then refreshes the cache in this order:
  1. It loads specific pages that the administrator specifies.
  2. It loads popular (frequently accessed) pages from the cache access log.
  3. If the maximum number of pages is not reached, more pages are loaded by delving.

The cache agent does not check whether the maximum number of pages has been reached until it starts delving across links. If the value for the maximum number of pages (called MaxURLs in the proxy configuration file) is lower than the number of pages that are retrieved in steps 1 and 2, no linked pages are retrieved.

The following examples show how the cache agent handles cache refresh priorities and delving, relative to the maximum number of URLs that are specified (assume that delving is configured for all of these examples).
Configuration file setting Result
LoadURL 
 http://www.getthis.com/main.html
LoadURL 
 http://www.getmetoo.com/welcome.htm
LoadTopCached 30
MaxURLs 50
If the Cache Access Log has more than 30 unique URLs, the cache agent retrieves main.html, welcome.htm, and the top 30 requested URLs based on the cache access log. Because it has not reached the MaxURLs value, it retrieves and loads up to 18 linked URLs from pages already cached.
LoadURL 
 http://ww.joesmith.edu/favorites.html
LoadURL
 http://www.janesmith.edu/dislikes.html
LoadTopCached 30
MaxURLs 25
If the cache access log has more than 30 unique URLs, the cache agent retrieves favorites.html, dislikes.html, and the top 30 requested URLs from the cache access log. No other files are retrieved because the value in MaxURLs has been exceeded.
LoadURL http://www.hello.com/hi.htm
LoadURL  
 http://www.ballyhoo.com/index.html
LoadTopCached 20
MaxURLs 25
If the cache access log has more than 20 unique URLs, the cache agent retrieves hi.htm, index.html, the top 20 requested URLs from the cache access log, and up to 3 linked URLs from the earlier pages. No other files are retrieved because the value in MaxURLs has been reached.

Related proxy configuration file directives

Starting the cache agent manually

If automatic cache refreshing is enabled, the cache agent automatically runs a refresh operation at the specified time. However, you also can run the cache agent at any time from a command line.

The executable file is as follows:
  • On Linux and UNIX operating systems: usr/sbin/cacheagt
  • On Windows operating systems: server_root \bin\cacheagt.exe

    Where server_root is the drive and directory where you installed Caching Proxy (for example, C:\Program Files\IBM\edge\cachingproxy\cp).

On Linux and UNIX operating systems, you can automatically run the cache agent at various times by using the cron daemon. Jobs that are controlled by cron are specified by adding a line to the system crontab file. An example entry of the command file on Linux and UNIX is:
45 16 * * * /usr/sbin/cacheagt
This command example starts the cache agent every day at 4:45 p.m. local time. You can use multiple entries to run the cache agent more than once, if needed. For more information, see your operating system's documentation about the cron daemon.

When using a cron daemon to run the cache agent, remember to turn off the automatic refresh option, either by using the Cache Configuration –> Cache Refresh configuration form or by editing the proxy configuration file. Otherwise, the cache agent runs more than once each day.


Icon that indicates the type of topic Reference topic



Timestamp icon Last updated: March 23, 2018 0:18
File name: cacheagent.html