Acl lines

The Examples so far have given you an idea of an acl line's layout. Their layout can be symbolized as follows (? Check! ?):

acl name type (string|"filename") [string2] [string3] ["filename2"]

The acl tag consists of a minimum of three fields: a unique name; an acl type and a decision string. An acl line can have more than one decision string, hence the [string2] and [string3] in the line above.

A unique name

This is supposed to be descriptive. Use a name such as customers or mynet. You have seen this lots of times before: the word myNet in the above example is one such case.

There must only be one acl with a given name; if you find that you have two or more classes with similar names, you can append a number to the name: customer1, customer2 etc. I generally avoid this, instead putting all similar data on these classes into a file, and including the whole file as one acl. Check the Decision String section for some more info on this.

Type

So far we have discussed only acls that check the source IP address of the connection. This isn't sufficient for many people: it may be useful for you to allow connections at only certain times, or to only specific domains, or by only some users (using usernames and passwords). If you really want to, you can even combine all of the above: only allow connections from users that have the right password, have the right destination and are going to the right domain. There are quite a few different acl types: the next section of this chapter discusses all of the different types in detail. In the meantime, let's finish the description of the structure of the acl line.

Decision String

The acl code uses this string to check if the acl matches a given connection. When using this field, Squid checks the type field of the acl line to decide how to use the decision string. The decision string could be an IP address range, a regular expression or a list of domains or more. In the next section (where we discuss the types of acls available) we discuss the different forms of the Decision String.

If you have another look at the formal definition of the acl line above, you will note that you can have more than one decision string per acl line. Strings in this format are ORd together; if you were to specify two IP address ranges on the same line the return result of the acl would be true if either of the IP addresses match. (If source strings were ANDd together, then an incoming request would have to come from two IP address ranges at the same time. This is not impossible, but would almost certainly be pointless.)

Large decision lists can be stored in files, so that your squid.conf doesn't get cluttered. Some of the caches I have worked on have had in the region of 2000 lines of acl rules, which could lead to a very cluttered squid.conf file. You can include a file into the decision section of an acl list by placing the filename (with path) in double-quotes. The file simply contains the data set; one datum per line. In the next example the file /usr/local/squid/conf/data/myNets can contain any number of IP ranges, one range per line.

While on the topic of long lists of acls: it's important to note that you can end up slowing your cache response with very long lists of acls. Checking acls requires CPU time, and long lists can decrease cache performance, since instead of moving data to clients Squid is busy checking access lists. What constitutes a long list? Don't worry about lists with a few hundred entries unless you have a really slow or busy CPU. Lists thousands of lines long can, however, cause problems.

Types of acl

So far we have only spoken about acls that filter by source IP address. There are numerous other acl types:

Source/Destination Domain

Squid can also limit requests by their source domain. Though it doesn't always happen in the real world, network administrators can add reverse DNS entries for each of the hosts on their network. (These records are normally referred to as PTR records.) Squid can make decisions about the validity of incoming requests by checking their reverse DNS entries. In the below example, the acl is true if the request comes from a host with a reverse entry that is in either the qualica.com or squid-cache.org domains.

acl myDomain srcdomain .qualica.com .squid-cache.org
acl allow myDomain

Reverse DNS matches should not be used where security is important. A determined attacker (who controlled the reverse DNS entries for the attacking host) would be able to manipulate these entries so that the request comes from your domain. Squid doesn't attempt to check that reverse and forward DNS entries match, so this option is not recommended.

Squid can also be configured to deny requests to specific domains. Many people implement these filter lists for pornographic sites. The legal implications of this filtering are not covered here: there are many, and the relevant law is in a constant state of flux, so advice here would likely be obsolete in a very short period of time. I suggest that you consult a good lawyer if you want to do something like this.

The dst acl type allows one to match accesses by destination domain. This could be used to match urls for popular adult sites, and refuse access (perhaps during specific times).

If you want to deny access to a set of sites, you will need to find out these site's IP addresses, and deny access to these IP addresses too. If you just put the IP addresses in, someone determined to access a specific site could find out the IP address associated with that hostname and access it by entering the IP address in their browser.

The above is best described with an example. Here, I assume that you want to restrict access to the site www.adomain.example. If you use either the host of nslookup commands, you would find that this server has the IP address 10.255.1.2. It's easiest to just have two acls: one for IPs and one for domains. If the lists get to large, you can simply place them in a file.

Words in the requested URL

Most caches can filter out URLs that contain a set of banned words. Regular expressions allow you to simply check if a word is in a given URL, but they also allow for more powerful searches of the URL. With a simple word check you would find it nearly impossible to create a rule that allows access to sites with the word sex in the URL, but at the same time denies access to all avi files on that site. With regular expressions this sort of checking becomes easy, once you understand the regex syntax.

Using Regular expressions to match words in the requested URL

Using regular expressions allows you to create more flexible access lists. So far you have only been able to filter sites by destination domain, where you have to match the entire domain to deny access to the site. Since regular expressions are used to match text strings, you can use them to match words, partial words or patterns in URLs or domains.

The most common use of regex filters in ACL lists is for the creation of far-reaching site filters: if the url or domain contain a set of banned words, access to the site is denied. If you wish to deny access to sites that contain the word sex in the URL, you would add one acl rule, rather than trying to find every site that has adult material on it.

The big problem with regex filters is that not all sites that contain the word sex in the URL are pornographic. By denying these sites you are likely to be infringing people's rights, and you should refer to a lawyer for advice on the legality of this.

Creating a list of sites that you don't want accessed can be tedious. There are companies that sell adult/unwanted material lists which plug into Squid, but these can be expensive. If you cannot justify the cost, you can

The url_regex acl type is used to match any word in the URL. Here is an example:

In places where bandwidth is very expensive, system administrators may have no problem with people visiting pornograpic sites. They may, however, want to stop people downloading huge avi files from these sites. The following example would deny downloads of avi files from sites that contain the word sex in the URL. The regular expression below matches any URL that contains the word sex AND ends with .avi.

The urlpath_regex acl strips off the url-type and hostname, checking instead only the path and filename.

Current day/time

Squid allows one to allow access to specific sites by time. Often businesses wish to filter out irrelevant sites during work hours. The Squid time acl type allws you to filter by the current day and time. By combining the dstdomain and time acls you can allow access to specific sites (such as your the sites of suppliers or other associates) during work hours, but allow access to other sites after work hours.

The layout is quite compact:

acl name time [day-list] [start_hour:minute-end_hour:minute]

Day list is a list of single characters indicating the days that the acl applies to. Using the first letter of the day would be ambiguous (since, for example, both Tuesday and Thursday start with the same letter). When the first letter is ambiguous, the second letter is used: T stands for Tuesday, H for Thursday. Here is a list of the days with their single-letter abreviations:

S - Sunday M - Monday T - Tuesday W - Wednesday H - Thursday F - Friday A - Saturday

Start_hour and end_hour are values in military time (17:00 instead of 5:00). End_hour must always be larger than start_hour; this means (unfortunately) that you cannot do the following:

# since start_time must be smaller than end_time, this won't work:
acl darkness 17:00-6:00

The only alternative to the darkness example above is something like this:

acl night time 17:00-24:00
acl early_morning time 00:00-6:00

As you can see from the original definition of the time acl, you can specify the day of the week (with no time), the time (with no day), or both the time and day (?check!?). You can, for example, create a rule that specifies weekends without specifying that the day starts at midnight and ends at the following midnight. The following acl will match on either Saturday or Sunday.

acl weekends time SA

The following example is too basic for real-world use. Unfortunately creating a good example requires some of the more advanced features of the http_access line; these are covered in the next section of this chapter, and examples are included there.

Destination Port

Web servers almost always listen for incoming requests on port 80. Some servers (notably site-specific search engines and unofficial sites) listen on other ports, such as 8080. Other services (such as IRC) also use high-numbered ports. Because of the way HTTP is designed, people can connect to things like IRC servers through your cache servers (even though the IRC protocol is very different to the HTTP protocol). The same problems can be used to tunnel telnet connections through your cache server. The major part of the HTTP specification that allows for this is the CONNECT method, which is used by clients to connect to web servers using SSL.

Since you generally don't want to proxy anything other than the standard supported protocols, you can restrict the ports that your cache is willing to connect to. The default Squid config file limits standard HTTP requests to the port ranges defined in the Safe_ports squid.conf acl. SSL CONNECT requests are even more limited, allowing connections to only ports 443 and 563.

Port ranges are limited with the port acl type. If you look in the default squid.conf, you will see lines like the following:

acl Safe_ports port 80 21 443 563 70 210 1025-65535

The format is pretty straight-forward: destination ports 443 OR 563 are matched by the first acl, 80 21 443, 563 and so forth by the second line. The most complicated section of the examples above is the end of the line: the text that reads "1024-65535".

The "-" character is used in squid to specify a range. The example thus matches any port from 1025 all the way up to 65535. These ranges are inclusive, so the second line matches ports 1025 and 65535 too.

The only low-numbered ports which Squid should need to connect to are 80 (the HTTP port), 21 (the FTP port), 70 (the Gopher port), 210 (wais) and the appropriate SSL ports. All other low-numbered ports (where common services like telnet run) do not fall into the 1024-65535 range, and are thus denied.

The following http_access line denies access to URLs that are not in the correct port ranges. You have not seen the ! http_access operator before: it inverts the decision. The line below would read "deny access if the request does not fall in the range specified by acl Safe_ports" if it were written in english. If the port matches one of those specified in the Safe_ports acl line, the next http_access line is checked. More information on the format of http_access lines is given in the next section Acl-operator lines.

http_access deny !Safe_ports

Method (HTTP GET, POST or CONNECT)

HTTP can be used for downloading (GETting data) or uploads (POSTing data to a site). The CONNECT mode is used for SSL data transfers. When a connection is made to the proxy the client specifies what kind of request (called a method) it is sending. A GET request looks like this:

GET http://www.qualica.com/ HTTP/1.1
blank-line

If you were connecting using SSL, the GET word would be replaced with the word CONNECT.

You can control what methods are allowed through the cache using the post acl type. The most common use is to stop CONNECT type requests to non-SSL ports. The CONNECT method allows data transfer in any direction at any time: if you telnet to a badly configured proxy, and enter something like the following, you could end up connected to a machine if you had telnetted there from the cache server. This could get around packet-filters, firewall access lists and passwords, which is generally considered a bad thing!

CONNECT www.domain.example:23 HTTP/1.1
blank-line

Since CONNECT requests can be quite easily exploited, the default squid.conf denies access to SSL requests to non-standard ports, as we spoke about in the previous section (on the port acl-operator.)

Let's assume that you want to stop your clients from POSTing to any sites (note that doing this is not a good idea, since people using some search engines (for example) would run into problems: at this stage this is just an example.

User name

Logs generally show the source IP address of a connection. When this address is on a multiuser machine (let's use a Unix machine at a university as an example) you cannot pin down a request as being from a specific user. There could be hundreds of people logged into the Unix machine, and they could all be using the cache server. Trying to track down a misbehaver is very difficult in this case, since you can never be sure which user is actually doing what. To solve this problem, the ident protocol was created. When the cache server accepts a connection, it can connect back to the origin server (on a low-numbered port, so the reply cannot be faked) and finds out who just connected. This doesn't make any sense on windows systems: people can just load their own ident servers (and become daffy duck for a day). If you run multi-user systems then you may want only certain people on those machines to be able to use the cache. In this case you can use the ident username to allow or deny access.

One of the best things about Unix is the flexibility you get. If you wanted (for example) only students in their second year on to have access to the cache servers via your Unix machines, you could create a replacement ident server. This server could find out which user that has connected to the cache, but instead of returning the username you could return a string like "third_year" or "postgrad". Rather than maintaining a list of which students are in on both the cache server and the central Unix system, you could simple Squid rules, and the ident server could do all the work where it checks which user is which.

Username/Password pair

If you want to track Internet usage it's best to get users to log into the cache server when they want to use the net. You can then use a stats program to generate per-user reports, no matter which machine on your network a person is using. Universities and colleges often have labs with many machines, where it is difficult to tell which user is sitting in front of a machine at any specific time. By using names and passwords you will solve this problem.

Squid uses modules to do user authentication, rather than including code to do it directly. The default Squid source does, however, include two standard modules; The first authenticates users from a file, the other uses SMB (Windows NT) authentication. These modules are in the auth_modules directory in the source directory. These modules are not compiled when you compile Squid itself, and you will need to chooes an authentication module and run make in the appropriate directory. If the compile goes well, a make install will place the program file in the /usr/local/squid/bin/ directory and any config files in the /usr/local/squid/etc/ directory.

NCSA authentication is the easiest to use, since it's self contained. The SMB authentication program requires that SAMBA be installed, since it effectively talks to the NT server through SAMBA.

The squid.conf file uses the authenticate_program tag to decide which external program to use to authenticate users. If Squid were to only start one authentication program, a slow username/password lookup could slow the whole cache down (while all other connections waited to be authenticated). Squid thus opens more than one authentication program at a time, sending pending requests to the second when the first is busy, the third when the second is and so forth. The actual number started is specified by the authenticate_children squid.conf value. The default number started is five, but if you have a heavily loaded cache then you will need to increase this value.