The Examples so far have given you an idea of an acl line's layout. Their
layout can be symbolized as follows (? Check! ?):
acl name type (string|"filename") [string2] [string3] ["filename2"]
The acl tag consists of a minimum of three fields: a unique name; an
acl type and a decision string. An acl line can have more than one
decision string, hence the [string2] and [string3] in the
line above.
This is supposed to be descriptive. Use a name such as customers or mynet. You have seen this lots of times before: the word myNet in the above example is one such case.
There must only be one acl with a given name; if you find that you have two or more classes with similar names, you can append a number to the name: customer1, customer2 etc. I generally avoid this, instead putting all similar data on these classes into a file, and including the whole file as one acl. Check the Decision String section for some more info on this.
So far we have discussed only acls that check the source IP address of the connection. This isn't sufficient for many people: it may be useful for you to allow connections at only certain times, or to only specific domains, or by only some users (using usernames and passwords). If you really want to, you can even combine all of the above: only allow connections from users that have the right password, have the right destination and are going to the right domain. There are quite a few different acl types: the next section of this chapter discusses all of the different types in detail. In the meantime, let's finish the description of the structure of the acl line.
The acl code uses this string to check if the acl matches a given connection. When using this field, Squid checks the type field of the acl line to decide how to use the decision string. The decision string could be an IP address range, a regular expression or a list of domains or more. In the next section (where we discuss the types of acls available) we discuss the different forms of the Decision String.
If you have another look at the formal definition of the acl line above, you will note that you can have more than one decision string per acl line. Strings in this format are ORd together; if you were to specify two IP address ranges on the same line the return result of the acl would be true if either of the IP addresses match. (If source strings were ANDd together, then an incoming request would have to come from two IP address ranges at the same time. This is not impossible, but would almost certainly be pointless.)
Large decision lists can be stored in files, so that your squid.conf doesn't get cluttered. Some of the caches I have worked on have had in the region of 2000 lines of acl rules, which could lead to a very cluttered squid.conf file. You can include a file into the decision section of an acl list by placing the filename (with path) in double-quotes. The file simply contains the data set; one datum per line. In the next example the file /usr/local/squid/conf/data/myNets can contain any number of IP ranges, one range per line.
While on the topic of long lists of acls: it's important to note that you can end up slowing your cache response with very long lists of acls. Checking acls requires CPU time, and long lists can decrease cache performance, since instead of moving data to clients Squid is busy checking access lists. What constitutes a long list? Don't worry about lists with a few hundred entries unless you have a really slow or busy CPU. Lists thousands of lines long can, however, cause problems.
So far we have only spoken about acls that filter by source IP address. There are numerous other acl types:
Source/Destination IP address
Source/Destination Domain
Regular Expression match of requested domain
Words in the requested URL
Words in the source or destination domain
Current day/time
Destination port
Protocol (FTP, HTTP, SSL)
Method (HTTP GET or HTTP POST)
Browser type
Name (according to the Ident protocol)
Autonomous System (AS) number
Username/Password pair
SNMP Community
In the examples earlier in this chapter you saw lines in the following
format:
acl myNet src 10.0.0.0/255.255.0.0 acl myNet src addr1-addr2/netmask
The above acl will match when the IP address comes from any
IP address between 10.0.0.0 and 10.0.255.255.
In recent years more and more people are using Classless Internet
Domain Routing (CIDR) format netmasks, like 10.0.0.0/16.
Squid handles both the traditional IP/Netmask and more recent
IP/Bits notation in the src acl type. IP ranges can also be
specified in a further format: one that is Squid specific. (? I
need to spend some time hacking around with these: I am not sure
of the layout ?)
http_access allow myNet
http_access allow myNet
Squid can also match connections by destination IP. The layout is very similar: simply replace src with dst. Here are a couple of examples:
Squid can also limit requests by their source domain. Though it
doesn't always happen in the real world, network administrators can add
reverse DNS entries for each of the hosts on their network. (These records
are normally referred to as PTR records.) Squid can make decisions
about the validity of incoming requests by checking their reverse DNS
entries. In the below example, the acl is true if the
request comes from a host with a reverse entry that is in either the
qualica.com or squid-cache.org domains.
acl myDomain srcdomain .qualica.com .squid-cache.org
acl allow myDomain
Reverse DNS matches should not be used where security is important. A determined attacker (who controlled the reverse DNS entries for the attacking host) would be able to manipulate these entries so that the request comes from your domain. Squid doesn't attempt to check that reverse and forward DNS entries match, so this option is not recommended.
Squid can also be configured to deny requests to specific domains. Many people implement these filter lists for pornographic sites. The legal implications of this filtering are not covered here: there are many, and the relevant law is in a constant state of flux, so advice here would likely be obsolete in a very short period of time. I suggest that you consult a good lawyer if you want to do something like this.
The dst acl type allows one to match accesses by destination domain. This could be used to match urls for popular adult sites, and refuse access (perhaps during specific times).
If you want to deny access to a set of sites, you will need to find out these site's IP addresses, and deny access to these IP addresses too. If you just put the IP addresses in, someone determined to access a specific site could find out the IP address associated with that hostname and access it by entering the IP address in their browser.
The above is best described with an example. Here, I assume that you want to restrict access to the site www.adomain.example. If you use either the host of nslookup commands, you would find that this server has the IP address 10.255.1.2. It's easiest to just have two acls: one for IPs and one for domains. If the lists get to large, you can simply place them in a file.
Most caches can filter out URLs that contain a set of banned words. Regular expressions allow you to simply check if a word is in a given URL, but they also allow for more powerful searches of the URL. With a simple word check you would find it nearly impossible to create a rule that allows access to sites with the word sex in the URL, but at the same time denies access to all avi files on that site. With regular expressions this sort of checking becomes easy, once you understand the regex syntax.
We haven't encountered regular expressions in this book yet. A regular expression (regex) is an incredibly useful way of matching strings. As they are incredibly powerful they can get a little complicated. Regexes are often used in string-oriented languages like Perl, where they make processing of large text files (such as logs) incredibly easy. Squid uses regular expressions for numerous things: refresh patterns and access control among them.
If you have not used regular expressions before, you might want to have a look at the O'Reilly book on regular expressions or the appropriate section in the O'Reilly perl book. Instead of going into detail here, I am just going to give some (hopefully) useful examples. If you have perl installed on your machine, you could have a look at the perlre manual page to get an idea as to how the various regex operators (such as .) function.
Regular expressions in Squid are case-sensitive by default. If you want to match both upper or lower-case text, you can prefix the regular expression with a -i. Have a look at the next example, where we use this to match either sex SEX (or even SeX).
Using regular expressions allows you to create more flexible access lists. So far you have only been able to filter sites by destination domain, where you have to match the entire domain to deny access to the site. Since regular expressions are used to match text strings, you can use them to match words, partial words or patterns in URLs or domains.
The most common use of regex filters in ACL lists is for the creation of far-reaching site filters: if the url or domain contain a set of banned words, access to the site is denied. If you wish to deny access to sites that contain the word sex in the URL, you would add one acl rule, rather than trying to find every site that has adult material on it.
The big problem with regex filters is that not all sites that contain the word sex in the URL are pornographic. By denying these sites you are likely to be infringing people's rights, and you should refer to a lawyer for advice on the legality of this.
Creating a list of sites that you don't want accessed can be tedious. There are companies that sell adult/unwanted material lists which plug into Squid, but these can be expensive. If you cannot justify the cost, you can
The url_regex acl type is used to match any word in the URL. Here is an example:
In places where bandwidth is very expensive, system administrators may have no problem with people visiting pornograpic sites. They may, however, want to stop people downloading huge avi files from these sites. The following example would deny downloads of avi files from sites that contain the word sex in the URL. The regular expression below matches any URL that contains the word sex AND ends with .avi.
The urlpath_regex acl strips off the url-type and hostname, checking instead only the path and filename.
Regular expressions can also be used for checking the source and destination domains of a request. The dstdom_regex tag is used to check that a request comes from a specific subdomain, while the dstdom_regex checks the domain part of the requested URL. (You could check the requested domain with a url_regex tag, but you could run into interesting problems with sites that refer to pages with urls like http://www.company.example/www.anothersite.example.)
Here is an example acl set that uses a regular
expression (rather than using the srcdomain and
dstdomain tags). This example allows you to deny access to
.com or .net sites if the request is from the .za
domain. This could be useful if you are providing a "public peering"
infrastructure to other caches in your geographical region.
Note that this example is only a fragment of a complete acl set: you would
presumably want your customers to be able to access any site, and
there is no final deny acl.
acl bad_dst_TLD dstdom_regex \.com$ \.net$
acl good_src_TLD srcdom_regex \.za$
# allow requests FROM the za domain UNLESS they want to go to \.com or \.net
http_access deny bad_dst_TLD
http_access allow good_src_TLD
Squid allows one to allow access to specific sites by time. Often businesses wish to filter out irrelevant sites during work hours. The Squid time acl type allws you to filter by the current day and time. By combining the dstdomain and time acls you can allow access to specific sites (such as your the sites of suppliers or other associates) during work hours, but allow access to other sites after work hours.
S - Sunday M - Monday T - Tuesday W - Wednesday H - Thursday F - Friday A - Saturday
Start_hour and end_hour are values in military time (17:00 instead of
5:00). End_hour must always be larger than start_hour; this means
(unfortunately) that you cannot do the following:
# since start_time must be smaller than end_time, this won't work: acl night time 17:00-24:00
The only alternative to the darkness example above is something like
this:
acl darkness 17:00-6:00
acl early_morning time 00:00-6:00
As you can see from the original definition of the time acl, you can
specify the day of the week (with no time), the time (with no day), or both
the time and day (?check!?). You can, for example, create a rule that
specifies weekends without specifying that the day starts at midnight and
ends at the following midnight. The following acl will match on either
Saturday or Sunday.
acl weekends time SA
The following example is too basic for real-world use. Unfortunately
creating a good example requires some of the more advanced features of the
http_access line; these are covered in the next section of this chapter,
and examples are included there.
Web servers almost always listen for incoming requests on port 80. Some servers (notably site-specific search engines and unofficial sites) listen on other ports, such as 8080. Other services (such as IRC) also use high-numbered ports. Because of the way HTTP is designed, people can connect to things like IRC servers through your cache servers (even though the IRC protocol is very different to the HTTP protocol). The same problems can be used to tunnel telnet connections through your cache server. The major part of the HTTP specification that allows for this is the CONNECT method, which is used by clients to connect to web servers using SSL.
Since you generally don't want to proxy anything other than the standard supported protocols, you can restrict the ports that your cache is willing to connect to. The default Squid config file limits standard HTTP requests to the port ranges defined in the Safe_ports squid.conf acl. SSL CONNECT requests are even more limited, allowing connections to only ports 443 and 563.
Port ranges are limited with the port acl type. If you look in
the default squid.conf, you will see lines like the following:
acl Safe_ports port 80 21 443 563 70 210 1025-65535
The format is pretty straight-forward: destination ports 443 OR 563
are matched by the first acl, 80 21 443, 563 and so forth by the
second line. The most complicated section of the examples above is the
end of the line: the text that reads "1024-65535".
The "-" character is used in squid to specify a range. The example thus matches any port from 1025 all the way up to 65535. These ranges are inclusive, so the second line matches ports 1025 and 65535 too.
The only low-numbered ports which Squid should need to connect to are 80 (the HTTP port), 21 (the FTP port), 70 (the Gopher port), 210 (wais) and the appropriate SSL ports. All other low-numbered ports (where common services like telnet run) do not fall into the 1024-65535 range, and are thus denied.
The following http_access line denies access to URLs that are not in
the correct port ranges. You have not seen the ! http_access
operator before: it inverts the decision. The line below would
read "deny access if the request does not fall in the range
specified by acl Safe_ports" if it were written in english. If the
port matches one of those specified in the Safe_ports acl line, the
next http_access line is checked. More information on the format of
http_access lines is given in the next section Acl-operator lines.
http_access deny !Safe_ports
Some people may wish to restrict their users to specific protocols. The proto acl type allows you to restrict access by the URL prefix: the http:// or ftp:// bit at the front. The following example will deny request that uses the FTP protocol.
The default squid.conf file denies access to a special type of URL, urls which use the cache_object protocol. When Squid sees a request for one of these URLs it serves up information about itself: usage statistics, performance information and the like. The world at large has no need for this information, and it could be a security risk.
HTTP can be used for downloading (GETting
data) or uploads (POSTing data to a site).
The CONNECT mode is used for SSL data
transfers. When a connection is made to the proxy the client
specifies what kind of request (called a method)
it is sending.
A GET request looks like this:
GET http://www.qualica.com/ HTTP/1.1
If you were connecting using SSL, the GET word would be replaced
with the word CONNECT.
blank-line
You can control what methods are allowed through the cache using the
post acl type. The most common use is to stop
CONNECT type requests to non-SSL ports. The
CONNECT method allows data transfer in any
direction at any time: if you telnet to a badly configured proxy,
and enter something like the following, you could end up connected
to a machine if you had telnetted there from the cache server. This
could get around packet-filters, firewall access lists and passwords,
which is generally considered a bad thing!
CONNECT www.domain.example:23 HTTP/1.1
Since CONNECT requests can be quite easily
exploited, the default squid.conf denies access
to SSL requests to non-standard ports, as we spoke about in the
previous section (on the port acl-operator.)
blank-line
Let's assume that you want to stop your clients from POSTing to any sites (note that doing this is not a good idea, since people using some search engines (for example) would run into problems: at this stage this is just an example.
Companies sometimes have policies as to what browsers people can use. The browser acl type allows you to specify a regular expression that can be used to allow or deny access.
Logs generally show the source IP address of a connection. When this address is on a multiuser machine (let's use a Unix machine at a university as an example) you cannot pin down a request as being from a specific user. There could be hundreds of people logged into the Unix machine, and they could all be using the cache server. Trying to track down a misbehaver is very difficult in this case, since you can never be sure which user is actually doing what. To solve this problem, the ident protocol was created. When the cache server accepts a connection, it can connect back to the origin server (on a low-numbered port, so the reply cannot be faked) and finds out who just connected. This doesn't make any sense on windows systems: people can just load their own ident servers (and become daffy duck for a day). If you run multi-user systems then you may want only certain people on those machines to be able to use the cache. In this case you can use the ident username to allow or deny access.
One of the best things about Unix is the flexibility you get. If you wanted (for example) only students in their second year on to have access to the cache servers via your Unix machines, you could create a replacement ident server. This server could find out which user that has connected to the cache, but instead of returning the username you could return a string like "third_year" or "postgrad". Rather than maintaining a list of which students are in on both the cache server and the central Unix system, you could simple Squid rules, and the ident server could do all the work where it checks which user is which.
Squid is often used by large ISPs. These ISPs want all of their customers to have access to their caches without having incredibly long manually-maintained ACL lists (don't forget that such long lists of IPs generally increase the CPU usage of Squid too). Large ISP's all have AS (Autonomous System) numbers which are used by other Internet routers which run the BGP (Border Gateway Protocol) routing protocol.
The whois server whois.ra.net keeps a (supposedly authoritive) list of all the IP ranges that are in each AS. Squid can query this server and get a list of all IP addresses that the ISP controls, reducing the number of rules required. The data returned is also stored in a radix tree, for more cpu-friendly retrieval.
Sometimes the whois server is updated only sporadically. This could lead to problems with new networks being denied access incorrectly. It's probably best to automate the process of adding new IP ranges to the whois server if you are going to use this function.
If your region has some sort of local whois server that handles queries in the same way, you can use the as_whois_server Squid config file option to query a different server.
If you want to track Internet usage it's best to get users to log into the cache server when they want to use the net. You can then use a stats program to generate per-user reports, no matter which machine on your network a person is using. Universities and colleges often have labs with many machines, where it is difficult to tell which user is sitting in front of a machine at any specific time. By using names and passwords you will solve this problem.
Squid uses modules to do user authentication, rather than including code to do it directly. The default Squid source does, however, include two standard modules; The first authenticates users from a file, the other uses SMB (Windows NT) authentication. These modules are in the auth_modules directory in the source directory. These modules are not compiled when you compile Squid itself, and you will need to chooes an authentication module and run make in the appropriate directory. If the compile goes well, a make install will place the program file in the /usr/local/squid/bin/ directory and any config files in the /usr/local/squid/etc/ directory.
NCSA authentication is the easiest to use, since it's self contained. The SMB authentication program requires that SAMBA be installed, since it effectively talks to the NT server through SAMBA.
The squid.conf file uses the authenticate_program tag to decide which external program to use to authenticate users. If Squid were to only start one authentication program, a slow username/password lookup could slow the whole cache down (while all other connections waited to be authenticated). Squid thus opens more than one authentication program at a time, sending pending requests to the second when the first is busy, the third when the second is and so forth. The actual number started is specified by the authenticate_children squid.conf value. The default number started is five, but if you have a heavily loaded cache then you will need to increase this value.
To use the NCSA authentication module, you will need to add the following
line to your squid.conf:
authenticate_program /usr/local/squid/bin/ncsa_auth /usr/local/squid/etc/passwd oskar:lKdpxbNzhlo.w
You will also need to create the appropriate password file
(/usr/local/squid/etc/passwd in the example above). This file
consists of a username and password pair, one per line, where the username
and password are seperated by a colon (:), just as they are in your
/etc/passwd file (assuming you are running Unix). The password is
encrypted with the same function as the passwords in /etc/passwd (or
/etc/shadow on newer systems) are.
Here is an example password line:
Since the encrypted passwords are the same, you could simply copy the
system password file periodically, since the ncsa_auth module
understands the /etc/passwd or /etc/shadow file format. If your
users do not already have passwords in unix crypt format somewhere, you
will
have to use the htpasswd program to generate the appropriate
user and password pairs. This program is included in the
/usr/local/squid/bin/ directory.
If you have configured Squid to support SNMP, you can also create acls that filter by the requested SNMP community. By combining source address (with the src acl type) and community filters (using the snmp_community acl type) you can restrict sensitive SNMP queries to administrative machines while allowing safer queries from the public. SNMP setup is covered in more detail later in the chapter, where we discuss the snmp_access acl-operator.