39. LVS: L7 Switching

39.1. Introduction

Switching packets based on the content of the payload (L7 switching) is something of a holy grail in this area. However, inspecting packet contents is slow (==CPU intensive) compared to L4 switching and should not be regarded as a panacea. Commercial L7 switches are expensive. It's better, where possible, to handle switching at the L4 layer. This can often by done by rewriting your e-commerce application in an L4 friendly manner.

There are others who think that L7 switching should be handled by the app.

malcolm lists (at) netpbx (dot) org 02 Jun 2006

LVS doesn't do L7 'cause L7 should be done by your app (i.e. that's what L7 is for.)

Malcolm's statement makes perfect sense technically, but when suits are faced with the choice of finding someone to recode the application, which will take time and will have to be tested, or paying an awful lot of money to bring in an L7 switch (which comes with a slick sales team and guaranteed support), the coder looses out.

Netfilter has code for inspecting the contents of packets through the u32 option. Presumably this could be coupled with fwmark to setup an LVS.

Malcolm lists (at) loadbalancer (dot) org 23 Nov 2006

Willy Tarreau's written HAProxy which I'm just testing now and so far it looks very good. Supports cookie insertion and SNAT etc.

Ratz - it's rock solid

A writeup of the situations requiring L7 switching and the mechanisms used to implement them is in the documentation on the DRWS website. Although at one time they had to be handled at L7, now solutions have been found at L4.

  • session partitioned

    L7 switching will direct all packets in an SSL session to the same realserver. At L4 this can be done with persistence. For a session involving multiple ports (e.g. http, https), this can be handled by fwmark with persistence.

    Motse gis89517 (at) cis (dot) nctu (dot) edu (dot) tw 16 Apr 2002 L4 persistence works well, if the web servers have the same contents, but not if two servers contain two (different) kinds of PHP codes which use session management. Some products solve this problem by inserting a cookie with the IP of back-end node. In this way, the web switch can redirect the request based on cookies. These products still have the same problem as L4 persistence with all web servers needing to have the same content. URL switching does not work at this situation. In my code, I rewrite the PHPSESSION cookie name originally as

    PHPSESSION=OOOUNNVJUFTYDKLNIUGTDRTYJUVL::POIOP
    to
    DRCCDDDDON=OOOUNNVJUFTYDKLNIUGTDRTYJUVL::POIOP
    where
            CC -> content support 0~255
            DDDD -> back-end server ID
    1 rewrite the cookie name --> need not to modify the HTTP HEADER
    2 rewrite the cookie name --> server A and server B can have different PHP code.
    

    A request may have two or more than two special cookies. The rewrite works at the realserver's packet filter. The cookie will be rewritten back at the packet filter too. In this way all of web server can have different content.

    If your LVS is a squid farm, then some (e.g. e-commerce) sites, will lock you out if the separate requests for hits on one page come from multiple squids. Your squid farm should not be setup this way - see scheduling squids

  • content partitioned

    Requests in a single HTTP keep-alive connection may need to access different servers. The L4 solution is to keep content identical on all servers. In the situation where content is dynamically generated, fwmark or shared directories can be used (i.e. the L4 friendly application method).

  • routing based on content, cookies, url, SSL ID

    You shouldn't use cookies (see Section 13.9.1). You can handle requests based on URL to a limited extent. As for requests based on SSL ID, no-one has an L4 solution. You may be able to find a application level solution.

39.2. KTCPVS

Preliminary L7 code, KTCPVS, has been written by Wensong. Some documentation is in the source code.

For a write up on cookie injection see mod_backhand.

39.3. DRWS

Ping-Tzay Tsai gis89517 (at) cis (dot) nctu (dot) edu (dot) tw 12 Apr 2002

I have implemented a direct routed web switch, which is a layer 7 load balancer with three new mechanisms. The code is a patch to ipvs 0.9.8. You can get it from the DRWS project website (http://speed.cis.nctu.edu.tw/~motse/drws.htm, link dead Oct 2002)

some benchmark using webbench 4.1 :

1 KTCPVS-0.0.5        1250 request/sec   82 Mbits/sec
2 DRWS 0.0.0-alpha  2550 request/sec   109 Mbits/sec

three web servers at back end - PIII 1G with 256MB eepro, the load balancer PIII 866 128MB eepro.

39.4. Alexandre's (unamed) L7 code

Alexandre Cassen Alexandre (dot) Cassen@free (dot) fr 11 Dec 2006

Just this quick email to announce the launch of www.linux-l7sw.org a new opensource project for layer7 switching. After discussions with wensong, I decided to start a new project to try to create a 'new' switching/loadbalancing code. The low level forwarding engine use TCP Splicing. Current code is a simple HTTP/1.0 proxy but will extend to rich featured stuff soon (if I found enought time). Had to spend lot of time in kernel stuff and global design. Current splicing code doesnt support TCP Selective Ack nor Windows scaling.

I have setup a mailing list at sf.net http://lists.sourceforge.net/mailman/listinfo/linux-l7sw-devel. But I am monitoring LVS ml too, just can be good to localize email on this topics into a dedicated mailing list.

39.5. from the mailing list about L7 switching

Michael Sparks Michael.Sparks (at) wwwcache (dot) ja (dot) net 12 Jul 1999

Some of the emerging redirection switches on the market support something known as Level 7 redirection which essentially allows the redirector to look at the start of the TCP data stream, by spoofing the initial connection and making load balancing based on what it sees there. (Apologies if I'm doing the equivalent of "teaching your grandma to suck eggs", but at least this way there's less mis-understanding of what I'm getting at/after)

For example if we have X proxy-cache servers, we could spoof an HTTP connection, grab the requested URL, and hash it to one of those X servers. If it was possible to look inside individual UDP packets as well, then we would be able to route ICP (inter cache prototol) packets in the same way. The result is a cluster that looks like a single server to clients.

Wensong

Do you mean that these X proxy-cache servers are not equivalent, and they are statically partitioned to fetch different objects? for example, the proxy server 1 is to cache European URLs, and the proxy server 2 is to cache Asian URLs, then there is a need to parse the packets to grab the requested URL. Right?

If you want to do this, I think Apache's mod_rewrite and mod_proxy can be used to group these X proxy-cache servers as a single proxy server. Since the overhead of dispatching requests in application level is high, its scalability may not be very good, the load balancer might be a bottleneck when there are 4 proxy servers or more.

The other way is to copy data packet to userspace to grap the request if the request is in a single UDP packet, the userspace program select a server based on the request and pass it back to the kernel.

For generic HTTP servers this could also allow the server to farm cgi-requests to individual machines, and the normal requests to the rest. (eg allowing you to buy a dual/quad processor to handle soley cgi-requests, but cheap/fast servers for the main donkey work.)

Wensong

It is statically partitioned. Not very flexible and scalable.

We've spoken to a number of commercial vendors in the past who are developing these things but they've always failed to come up with the goods, mainly citing incompatibility between our needs and their designs :-/

Any ideas how complicated this would be to add to your system?

Wensong

If these proxy-cache servers are identical(they are the same of all kind of URL requests), I have a good solution to use LVS to build a high-performance proxy server.

request
|-<--fetch objects directly
|                    |-----Squid 1---->reply users directly
|->LinuxDirector  ---|
  (LVS-Tun/VS-DR)    |-----Squid 2
                     |     ...
                     |-----Squid i

Since LVS-Tun/VS-DR is on client-to-server half connection, squid servers can fetch object directly from the Internet and return objects directly to users. The overheading of forwarding request is low, scalabilty is very good.

The ICP is used to query among these Squid servers. In order to avoid the mulitcasting storm, we can add one more NIC in each squid server for ICP query, we can call it multicast channel.

again Michael,

The reason I asked if the code could be modified to do this:

look at the start of the TCP data stream, by spoofing the initial
connection and making load balancing based on what it sees there.

Is to enable us to do this:

  1. Spoof the connection until we're able to grab the URL requested in the HTTP header.
  2. Based on that info make a decision as to which server should deal with the request.
  3. Fake the start of a TCP request to that server, making it look like it's come from the original client, based on the info we got from the original client in 1) And when we reach the end of the data we recieved from 1) stop spoofing and hand over the connection.

    Boxes like the Arrowpoint CS100/CS800 do this sort of thing, but their configurable options for 2) isn't complex enough.

  4. Forward UDP packets after looking inside the payload, seeing if it's an ICP packet and if so forwarding based on the ICP payload.

The reason for 1,2 and 3 is to have deterministic location of cached data, to eliminate redundancy in the cache system, and to reduce intra-cache cluster communication. The reason for 4 is because the clients of our caches are caches themselves - they're the UK National Academic root level caches servicing about a 3/4 billion requests per month during peak periods.

Also 2) can be used in future to implement a cache-digest server to serve a single cache digest for the entire cluster to eliminate delays for clients caused by ICP. (During peak periods this is large.)

The boxes from Arrowpint can do 1-3, but not 4 for example, and being proprietory hardware...

Essentially the ICP+cache digest thing for the cluster is the biggest nut - squid 2.X in a CARP mode can do something similar to 1,2 and 3, at the expense of having to handle a large number of TCP streams, but wouldn't provide a useful ICP service (it would always return ICP MISS), and can't provide the cache digest service. (Or would at least return empty digests)

I have a good solution to use LVS to build a high-performance proxy server.

Fine (to an extent) for the case where requests are coming from browsers, but bad where the clients are caches:

  • Client sends ICP packet to Linux Director, forwarded to cache X resulting in a ICP HIT reply, so client sends HTTP request to Linux Director, which forwards request to cache Y, resulting in a MISS, Y grabs from X, no major loss.
  • However same scenario with X and Y flipped, results in ICP MISS, where there would've been a cluster hit. Pretty naff really!

It's even worse with cache digests...

We've had to say this sort of thing to people like Alteon, etc too... (Lots more details in an overview at http://epsilon3.mcc.ac.uk/~zathras/WIP/Cache_Cooperation/).

later...

A slightly better discussion of how these techniques can be used is at: http://www.ircache.net/Cache/Workshop99/Papers/johnson-0.ps.gz)

Wensong

I have read the paper "Increasing the performance of transparent caching with content-aware cache bypass". If no inter-cache cooperation is concerned, it can be easily done on Linux, you don't need to buy expensive arrowpoint. As for availability, Linux boxes are reliable now. :)

I can modify the transparent proxy on Linux to do such a content-aware bypass and content-aware switching. The content-aware bypass will enable fetch non-cacheable objects directly, and the content-aware switching can let your cache cluster not overlapped.

39.6. What is TCPSP?

Aihua Liu liuah (at) langchaobj (dot) com (dot) cn May 26, 2003

What is TCPSP?

Horms

I believe there is a small ammount of documentation on TCPSP in the tarball provided.

IPVS is an implementation of layer 4 loadbalancing for the Linux Kernel. That is, it allows you to create virtual services which load balance traffic to realservers.

TCPSP is an implementation of TCP socket splicing for the Linux Kernel. This means that you can open a pair of sockets in user space and then join them together in the kernel. For example a daemon that listens for connections from end-users and then makes a corresponding connection to a real-server. Then after the connection is set up, the two sockets are spliced and all further traffic between the sockets is handled by the kernel. This avoids the kernel to userspace and userspace to kernel copies and context switches that are required to handle the sockets in user space.

In sort, TCPSP may be useful for implementing Layer 7 swiching where the begining of a connection is handled in user-space but the remainder of the connection is handled in the kernel.

From what I know, TCPSP'work as follows:

  1. Client builds up the connection with director first.
  2. After the director receives the client requests, then it sets up connection with the selected real-server and sends the requests to it.
  3. Real-server send replies to director.
  4. Director receives the replies from the realserver and responses to the client through tcp-splicing.

Is this correct?

Horms

That is how an application that uses TCPSP could work. And is more or less how the demonstration programme works. But, TCPSP is just a mechanism, to allow you to splice connections. It is not a programme itself.

Alexandre Cassen Alexandre (dot) Cassen (at) wanadoo (dot) fr

In step 2; parse the client request at layer7 level, then select the remote realserver to connect to (run the loadbalancing algorithm). When selected, the director creates a new socket to the selected realserver. Then the director forwards the incoming client request at layer7. In the linux kernel, zerocopy is used to speed processing to forward the client request.

Step 4: To speed up the process, both sockets are spliced when the realserver has been selected and connect established. That way we optimize time for upstream forwarding.

We also need to know how to unsplice the socket pair. Unsplicing can be done using the application specific header to detect end of stream forwarding. For HTTP/SSL we can use for example the "Content-Length:" header value to detect end of stream and to unsplice socket pair. But For efficient applications and because most of webserver are using HTTP/1.1 there is a big benefit keeping as possible the socket pair spliced since client->webserver are using "HTTP/1.1 keepalive" connection (persistent), so that the remote webserver on the realserver is still connected after the first GET is processed. Thus we optimize the forwarding but it is more difficult to know when socket pair must be unspliced.

Does the current implementation support unsplicing at all?

Alexandre

I spoke about this with Wensong last month, but we haven't done it yet.

I agree that is probably isn't needed very much, especially in the keepalive case.

Yes, but we need to handle failover (dead realservers), mainly the problem of peer connection error reflected to socket descriptor.