Client browsers can have all options configured manually, or they can be configured to download a autoconfig file (every time the startup), which provides all of the information about your cache setup.
Each URL referenced (be it the URL that you typed, or the URL for a graphic on the page yet to be retrieved) is checked against the list of rules. You should keep the list of rules as short as possible, otherwise you could end up slowing down page loads - not at the cache level, but at the browser.
The original Netscape documentation for the proxy autoconfig file suggested the filename proxy.pac for Proxy AutoConfig files. Since it's possible to have a file ending in .pac that is not used for autoconfiguration, browsers require a server returning an autoconfig file to indicate so in the mime type. Most web servers do not automatically recognize the .pac extension as a proxy-autoconfig file, and have to be reconfigured to return the correct mime type (application/x-ns-proxy-autoconfig).
On some systems Apache already defines the autoconfig mime type.
The Apache config file mime.types is used to associate filename
extensions with mime types. This file is normally stored in the apache
conf directory. This directory also contains the access.conf
and httpd.conf files, which you may be more familiar with editing.
As you can probalby see, the mime.types file consists of two fields:
a mime type on the left, the associated filename extension on the right.
Since this file is only read at startup or reconfigure, you will need to
send a HUP signal to the parent apache process for your changes to
take affect. The following line should be added to the file, assuming that
it is not already included:
application/x-ns-proxy-autoconfig pac
(?nothing here yet?)
(? or here ?)
The autoconfig file is actually a Java function, put in a file and served by your standard web server program. Don't panic if you don't know Java, since this section acts as a cookbook. Besides: the basic structure of the Java language is quite easy to get the hang of, especially if you have previous programming experience, whether it be in C, Pascal or Perl.
If you have learned a programming language, you probably remember one of the most basic programs simply printing the phrase Hello World!. We don't want to print anything when someone tries to go to a page, but the following example is similar to the original Hello World program in that it's the shortest piece of code that does something useful.
The following simply connects direct to the origin server for every URL, just as it would if you had no proxy-cache configured at all.
The next example gets the browser to connect to the cache server named cache.domain.example on port 3128. If the machine is down for some reason, an error message will be returned to the user.
As you may be able to guess from the above, returning text with a semicolon (;) splits the answer returned into two sub-strings. If the first cache server is unavailable, the second will be tried. This provides you with a failover mechanism: you can attempt a local proxy server first and, if it is down, try another proxy. If all are down, a direct attempt will be made. After a short period of time, the proxy will be retried.
A third return type is included, for SOCKS proxies, and is in the
same format as the HTTP type:
return "SOCKS socks.domain.example:3128";
If you have no intranet, and require no exclusions, you should use the above autoconfig file. Configuring machines with above autoconfig file allows you to add future required exclusions very easily.
Web browsers include various built-in functions to make your autoconfig coding as simple as possible. You don't have to write the code that does a string match of the hostname, since you can use a standard function call to do a match. Not all functions are covered here, since some of them are very rarely used. You can find a complete list of autoconfig functions (with examples) at the proxy autoconfig file homepage
Returns true if the first argument (normally specified as the variable host, which is defined in the autoconfig function by default) is in the domain specified in the second argument. Checks if a host is in a domain.
You can check more than one domain by using the || Java operator. Since this is a Java operator you can use the layout described in this example in any combination.
Sometimes you will wish to check if a host is in your local IP address range. To do this, the browser resolves the name to find the IP address. Do not use more than one isInNet call if you can help it: each call causes the browser to resolve the hostname all over again, which takes time. A string of these calls can reduce browser performance noticeably.
The isInNet function takes three arguments: the hostname, and a subnet/netmask pair.
Simply checks that there is no full-stop in the hostname (the only argument for this call). Many people refer to local machines simply by hostname, since the resolver library will automatically attempt to look up host.domain.example if you simply attempt to connect to host. For example: typing www in your browser should bring up your web site.
Many people connect to internal web servers (such as one sitting on their co-worker's desk) by typing in the hostname of the machine. These connections should not pass through the cache server, so many people use a function like the following:
Returns the IP address of the machine that the browser is running on, requires no arguments.
On a network with more than one cache, your script can use this information to decide which cache to communicate with. In the next subsection we look at different ways of communicating with a local proxy (with minimal manual user intervention), so the example here is comparatively basic. The below example assumes that you have more than two networks: one with a private address range (10.0.0.*), the others with real IP addresses.
If the client machine is in the private address range, it cannot connect directly to the destination server, so if the cache is down for some reason they cannot access the Internet. A machine with a real IP address, on the other hand, should attempt to connect directly to the origin server if the cache is down. (? need to check it will work too! ?).
Since myIpAddress requires no arguments, we can simply place it in where we would have put host in the isInNet function call.
The shExpMatch function accepts two arguments: a string and a shell expression. Shell expressions are similar to regular expressions, though are more limited. This function is often used to check if the url or host variables have a specific word in them.
If you are configuring a ISP-wide script, this function can be quite useful. Since you do not know if a customer will call their machine "intranet" or "intra" or "admin", you can chain many shExpMatch checks together. Note that in the below example uses a single "intra*" shell expression to match both "intranet" and "intra.mydomain.example".
This function doesn't take the same form as those described above. Since Squid does not support all possible protocols, you need a way of comparing the first few characters of the destination URL with the list of possible protocols. The function has two arguments. The first is a starting position, the second the number of characters to retrieve. Note that (like C), string start at position 0, rather than at 1.
All of this is best demonstrated with an example. The following attempts to connect to the cache for the most common URL types (http, ftp and gopher), but attempts to go directly for protocols that Squid doesn't recognize.
The main reason that autoconfig files were invented was the sheer number of possible cache setups. It's difficult (or even impossible) to represent all of the possible combinations that a autoconfig file can provide you with.
There is no config file that will work for everyone, so a couple of config files are included here, one of which should suit your setup.
A small organization is the easiest to create an autoconfig file for. Since you will have a moderately small number of IP addresses you can use the isInNet function to discover if the destination host is local or not (a large organization, such as an ISP would need a very long autoconfig file simply because they have many IP address ranges).
Since dialup customers don't have intranet systems, a dialup ISP would have a very straight forward config file. If you wish your customers to connect directly to your web server (why waste the disk space of a cache when you have the origin server rack-mounted above it), you should use the dnsDomainIs function:
When you are providing a public service, you have no control over what your customers call their machines. You have to handle the generic names (like intranet) and hope that people name their machines according to the de-facto standards.
(? I need some info on ieak - waiting for people here?)
Many large ISPs will have more than one cache server. To avoid duplicating objects, these cache servers have to communicate with one another. Consider the following;
cache1 gets a request for an object. It caches the page, and stores it on disk. An hour or so later, cache2 gets a request for the same page. To find a local copy of the object, cache2 has to query the other caches. Add more and more caches, and your number of queries goes up.
If an incoming request for a specific URL only ever went to one cache, your caches would not need to communicate with one another. A client requesting the page http://www.qualica.com/ would always connect to cache1.
Let's assume that you have 5 caches. Splitting the Internet into five pieces would split the load across the caches almost evenly. How do you split though? By destination IP address? No, since IP's like 19?.*.*.* are much more common than "5.*.*.*". By domain? No again, since one domain like microsoft.com would mean that you were distributing load incorrectly.
Some of you will know what a hash function is. If not, don't panic: you can still use the Super Proxy script without knowing the theoretical basis of the algorithms involved.
The Super Proxy Script allows you to split up the Internet by URL (the combination of hostname, path and filename). If you have 5 cache servers, you split up the domain of possible answers into 5 parts. (A hash function returns a number, so we are using the appropriate terms - a domain is not an Internet domain in this context). With a good hashing function, the numbers returned are going to be spread across the 5 parts evenly, which spreads your load perfectly.
If you have a cache which is twice as powerful as your others, you can allocate it more of the domain, and put more load on it.
Carp is used by some cache servers (most notably Microsoft Proxy and Squid) to decide which parent cache to send a request too. Browsers can also use CARP to decide which cache to talk to, using a java auto-config script. For more information (and an example Java script), you should look at the Super Proxy Script web page.