Deciding what hierarchy structure to use is difficult. Not only that, but it's quite often very difficult to change, since you can have thousands of clients accessing you directly, and even more through your peers.
Here, I cover the most common cache-peer architectures. We start off with the most simple setup that's considered peering: two caches talking to one another as siblings. I am going to try and cover all the issues with this simple setup, and then move to larger cache meshes.
We have two caches, your cache and their cache. You have a nice fast link between them, with low-latency, both caches have quite a lot of disk space, and they aren't going to be running into problems with load anytime soon. Let's look at the peering options you have:
ICP peering. ICP is the obvious choice here; the low latency line means that checking with the other cache doesn't cause significant added latency, and the server on the other side isn't going to become a limiting factor soon.
Multicast ICP peering. Multicast is not really useful here. Multicast is useful only when your cache is talking to lots of other caches. With one cache on the other side, the setup time for multicast doesn't seem worth it. If you got some other benefit from the multicast configuration (video conferencing, for example), then things would become more worthwhile. In the meantime, however, there is no significant bandwidth or load saving advantage to using multicast.
Cache Digests. Cache digests are normally quite large. By the sounds if this, the caches are not very busy. Transferring digest summaries between the caches may use more bandwidth than you think. ICP queries are only sent when an object is requested, but cache-digests are retrieved automatically throughout the day and night. If the cache is transferring less than the digest size in a ten minute period, you should probably use ICP. If the line was high-latency, you should consider digests more carefully: high-latency lines are normally better at bulk-data transfer than at sending lots of small packets. With a high-latency link, Squid can spend so much time waiting for returning ICP packets that browsing feels slow. With cache-digests, the cache would know if the remote side has the object or not, at the cost of more bandwidth.
The most common problem with this kind of setup is the "continuous object exchange". Let's say that your cache has an object. A user of their cache wants the object, so they retrieve it from you. A few hours later you expire the object. The next day, one of your users requests the object. Your cache checks with the other cache and finds that it does, indeed, have the object (it doesn't realize that it was the one that retrieved the object in the first place). It retrieves the object. Later on the whole process repeats itself, with their cache getting the object from you again. To stop this from happening, you may have to use the proxy-only option to the cache_peer line on both caches. This way, the caches simply retrieve their data from the fast sibling cache each time: if that cache expires the object, the object cannot return from the other cache, since it was never saved there.
With ICP, there is a chance that an object that is hit is dynamically generated (even if the path does not say so). Cache digests fix this problem, which may make their extra bandwidth usage worthwhile.
The traditional cache hierarchy structure involves lots of small servers (with their own disk space, each holding the most common objects) which query another set of large parent servers (there can even be only one large server.) These large servers then query the outside on the client cache's behalf. The large servers keep a copy of the object so that other internal caches requesting the page get it from them. Generally, the little servers have a small amount of disk space, and are connected to the large servers by quite small lines.
This structure generally works well, as long as you can stop the top-level servers from becoming overloaded. If these machines have problems, all performance will suffer.
Client caches generally do not talk to one another at all. The parent cache server should have any object that the lower-down cache may have (since it fetched the object on behalf of the lower-down cache). It's invariably faster to communicate with the head-office (where the core servers would be situated) than another region (where another sibling cache is kept).
In this case, the smaller servers may as well treat the core servers as default parents, even using the no-query option, to reduce cache latency. If the head-office is unreachable it's quite likely that things may be unusable altogether (if, on the other hand, your regional offices have their own Internet lines, you can configure the cache as a normal parent: this way Squid will detect that the core servers are down, and try to go direct. If you each have your own Internet link, though, there may not be a reason to use a tree structure. You might want to look at the mesh section instead, which follows shortly.)
To avoid overloading one server, you can use the round-robin option on the cache_peer lines for each core server. This way, the load on each machine should be spread evenly.
Large hierarchies generally use either a tree structure, or they are true meshes. A true mesh considers all machines equal: there is no set of large root machines, mainly since they are almost all large machines. Multicast ICP and Cache digests allow large meshes to scale well, but some meshes have been around for a long time, and are only using vanilla ICP.
Cache digests seem to be the best for large mesh setups these days: they involve bulk data transfer, but as the average mesh size increases machines will have to be more and more powerful to deal with the number of queries coming in. Instead of trying to deal with so many small packets, it is almost certainly better to do a larger transfer every 10 minutes. This way, machines only have to check their local ram to see which machines have the object.
Pure multicast cache meshes are another alternative: unfortunately there are still many reply packets generated, but it still effectively halves the number of packets flung around the network.
Sometimes, a single server cannot handle the load required. DNS or CARP load balancing will allow you to split the load between two (or more) machines.
DNS load balancing is the simplest option: In your DNS file, you simply add two A records for the cache's hostname (you did use a hostname for the cache when you configured all those thousands of browsers like I told you, right?) The order that the DNS server returns the names in is continuously, randomly switched, and the client requesting the lookup will connect to a random server. These server machines can be setup to communicate with one-another as peers. By using the proxy-only option, you reduce duplication of objects between the machines, saving disk space (and, hopefully, increasing your hit rate.)
There are other load-balancing options. If you have client caches accessing the overloaded server (rather than client pcs), you can configure Squid on these machines with the round-robin option on the cache_peer lines. You could also use the CARP (Cache Array Routing Protocol) to split the load unevenly (if you have one very powerful machine and two less powerful machines, you can use CARP to load the fast cache twice as much as the other machines).