Cache Auto-config

Client browsers can have all options configured manually, or they can be configured to download a autoconfig file (every time the startup), which provides all of the information about your cache setup.

Each URL referenced (be it the URL that you typed, or the URL for a graphic on the page yet to be retrieved) is checked against the list of rules. You should keep the list of rules as short as possible, otherwise you could end up slowing down page loads - not at the cache level, but at the browser.

Web server config changes for autoconfig files

The original Netscape documentation for the proxy autoconfig file suggested the filename proxy.pac for Proxy AutoConfig files. Since it's possible to have a file ending in .pac that is not used for autoconfiguration, browsers require a server returning an autoconfig file to indicate so in the mime type. Most web servers do not automatically recognize the .pac extension as a proxy-autoconfig file, and have to be reconfigured to return the correct mime type (application/x-ns-proxy-autoconfig).

Autoconfig Script Coding

The autoconfig file is actually a Java function, put in a file and served by your standard web server program. Don't panic if you don't know Java, since this section acts as a cookbook. Besides: the basic structure of the Java language is quite easy to get the hang of, especially if you have previous programming experience, whether it be in C, Pascal or Perl.

The Hello World! of auto-configuration scripts

If you have learned a programming language, you probably remember one of the most basic programs simply printing the phrase Hello World!. We don't want to print anything when someone tries to go to a page, but the following example is similar to the original Hello World program in that it's the shortest piece of code that does something useful.

The following simply connects direct to the origin server for every URL, just as it would if you had no proxy-cache configured at all.

The next example gets the browser to connect to the cache server named cache.domain.example on port 3128. If the machine is down for some reason, an error message will be returned to the user.

As you may be able to guess from the above, returning text with a semicolon (;) splits the answer returned into two sub-strings. If the first cache server is unavailable, the second will be tried. This provides you with a failover mechanism: you can attempt a local proxy server first and, if it is down, try another proxy. If all are down, a direct attempt will be made. After a short period of time, the proxy will be retried.

A third return type is included, for SOCKS proxies, and is in the same format as the HTTP type:

return "SOCKS socks.domain.example:3128";

If you have no intranet, and require no exclusions, you should use the above autoconfig file. Configuring machines with above autoconfig file allows you to add future required exclusions very easily.

Auto-config functions

Web browsers include various built-in functions to make your autoconfig coding as simple as possible. You don't have to write the code that does a string match of the hostname, since you can use a standard function call to do a match. Not all functions are covered here, since some of them are very rarely used. You can find a complete list of autoconfig functions (with examples) at the proxy autoconfig file homepage

Example autoconfig files

The main reason that autoconfig files were invented was the sheer number of possible cache setups. It's difficult (or even impossible) to represent all of the possible combinations that a autoconfig file can provide you with.

There is no config file that will work for everyone, so a couple of config files are included here, one of which should suit your setup.

Super Proxy Script

Many large ISPs will have more than one cache server. To avoid duplicating objects, these cache servers have to communicate with one another. Consider the following;

cache1 gets a request for an object. It caches the page, and stores it on disk. An hour or so later, cache2 gets a request for the same page. To find a local copy of the object, cache2 has to query the other caches. Add more and more caches, and your number of queries goes up.

If an incoming request for a specific URL only ever went to one cache, your caches would not need to communicate with one another. A client requesting the page http://www.qualica.com/ would always connect to cache1.

Let's assume that you have 5 caches. Splitting the Internet into five pieces would split the load across the caches almost evenly. How do you split though? By destination IP address? No, since IP's like 19?.*.*.* are much more common than "5.*.*.*". By domain? No again, since one domain like microsoft.com would mean that you were distributing load incorrectly.

Some of you will know what a hash function is. If not, don't panic: you can still use the Super Proxy script without knowing the theoretical basis of the algorithms involved.

The Super Proxy Script allows you to split up the Internet by URL (the combination of hostname, path and filename). If you have 5 cache servers, you split up the domain of possible answers into 5 parts. (A hash function returns a number, so we are using the appropriate terms - a domain is not an Internet domain in this context). With a good hashing function, the numbers returned are going to be spread across the 5 parts evenly, which spreads your load perfectly.

If you have a cache which is twice as powerful as your others, you can allocate it more of the domain, and put more load on it.

Carp is used by some cache servers (most notably Microsoft Proxy and Squid) to decide which parent cache to send a request too. Browsers can also use CARP to decide which cache to talk to, using a java auto-config script. For more information (and an example Java script), you should look at the Super Proxy Script web page.