19. LVS: Transparent proxy (TP or Horms' method)

Horms worked out that transparent proxy could be used in an LVS

Transparent proxy is a piece of Linux kernel code which allows a packet destined for an IP _not_ on the host, to be accepted locally, as if the IP was on the host. Transparent proxy is the mechanism by which you can make a director (or realserver) work without having the VIP configured on it.

Transparent proxy allows the realserver to solve The Arp Problem. The director sends the packets to the MAC address of the realserver; transparent proxy tells the realserver to accept the packet with dst_addr=VIP (even though this IP is not on the realserver); since there is no VIP on the realserver, it does not reply to arp queries for the VIP.

Without the VIP on a machine, methods other than the normal IP routing are required to deliver packets with dst_addr=VIP (see routing to a director without a VIP).

A VIP-less way of setting up an LVS is firewall mark (fwmark). There an incoming packet is marked and the mark (rather than a VIP) is used to forward the packet. In the case of using a fwmark on a director, the packet still has to be accepted on the director. For this you need the VIP on the outside NIC or you need TP. It would be nice for an LVS if a packet with a fwmark that is in the ipvsadm table (i.e. this is a packet to be forwarded by ip_vs) could be accepted by the node without having to also put the VIP on the node (and without using TP), by a modification of the LVS code. In principle this is possible and Julian would write it, if he thought it was going to be used. At the moment I'm the only one asking for it.

Note

Feb 2003: The TP implementation for stock 2.4.x kernels behaves differently than the 2.2. For 2.4 the packet is accepted locally with the primary IP of the NIC, rather than the VIP, as for 2.2 kernels. This makes 2.4 TP unusable for LVS directors, although it still works fine for web-caches (i.e. squids), it's original purpose. On talking to Harald Welte at the 2001 Ottawa Linux Symposium, there had been much discussion on the netfilter mailing lists as to whether to preserve the original behaviour. Since no-one (that they knew about) needed the original behaviour, that functionality was dropped. It seems too late to restore the functionality to netfilter now.

Some of the functionality that LVS wants out of TP is available via firewall mark (fwmark) and so the issue is probably moot now, and we're not going to ask the netfilter people to restore the original TP functionality for 2.4 kernels.

It's possible to patch the code for LVS, but this would require someone to keep track of the netfilter code for each version of the kernel. RedHat has patched its kernels to restore the TP functionality and Ratz maintains patches for the standard kernel (see below).

Most of the writeup for 2.4 kernels in this section was my efforts to find out what was happening with 2.4 TP. This section of the HOWTO will have to be rewritten when people start using the 2.4 TP patches.

Take-home lesson: TP only works for LVS (directors and realservers) on 2.0 and 2.2 kernels. For 2.4 (and higher) TP only works on realservers for LVS to handle the Arp problem.

Note: web caches (proxies) can operate in transparent mode, when they cache all IP's on the internet. In this mode, requests are received and transmitted without changing the port numbers (ie port 80 in and port 80 out). In a normal web cache, the clients are asked to reconfigure their browsers to use the proxy, some_IP:3128. It is difficult to get clients to do this, and the solution is transparent caching. This is more difficult to setup, but all clients will then use the cache.

In the web caching world, transparent caching is often called "transparent proxy" because it is implemented with transparent proxy. In the future, it is conceivable that transparent web caching will be implemented by another feature of the tcpip layer and it would be nice if functionality of transparent web caching had a name separate from the command that is used to implement it.

19.1. setting up routing and packet delivery to the director

To use TP in an LVS, packets from the client have to be delivered to a machine which does not have the IP of the dst_addr of the client's packets (i.e. the VIP). Read the part of the section on routing and delivery concerned with routing packets to machines without the dst_addr.

19.2. General

This is Horms' (horms (at) vergenet (dot) net) method (also called the transparent proxy or TP method). It uses the transparent proxy feature of ipchains to accept packets with dst=VIP by the host (director or realservers) when it doesn't have the IP (eg the VIP) on a device. It can be used on the realservers (where it handles the The Arp Problem) or the director to accept packets for the VIP. When used on the director, TP allows the director to be the default gw for LVS-DR (see martian modification).

Unfortunately the 2.2 and 2.4 versions of transparent proxy are as different as chalk and cheese in an LVS. Presumably the functionality has been maintained for for transparent web caching but the effect on LVS has not been considered.

You can use transparent proxy for

  • 2.2.x, director and realservers
  • 2.4.x, realservers only (where it handles the Arp problem)

(Historical note from Horms:) From memory I was getting a cluster ready for a demo at Inetnet World, New York which was held in October 1999. The cluster was to demo all sorts of services that Linux could run that were relevant to ISPs. Apache, Sendmail, Squid, Bind and Radius I believe. As part of this I was playing with LVS-DR and spotted that the realservers coulnd't accept traffic for the VIP. I had used Transparent Proxying in the past so I tried it and it worked. That cluster was pretty cool, it took me a week to put it together and it was an ISP in an albeit very large box.

Transparent proxy is only implemented in Linux.

  • 2.2.x you need IP masquerading, transparent proxing and IP firewalls turned on.
  • 2.4.x, TP is a standard part of the kernel build, there is no separate TP option. In the netfilter options, there are the options under "Full NAT (NEW)" MASQUERADE, REDIRECT. I suspect you need all these.

Julian

Transparent proxy support calls ip_local_deliver from where the LVS code is reached. One of the advantages of this method is that it is easy for a director and realserver to exchange roles in a failover setup.

19.3. How you use TP

This is a demonstration of TP using 2 machines: a realserver (which will accept packets by TP) and a client (i.e. this is not an LVS).

On the realserver: ipv4 forwarding must be on.

echo "1" > /proc/sys/net/ipv4/ip_forward

You want your realserver to accept telnet requests on an IP that is not on the network (say 192.168.1.111). Here's the result of commands run at the server console before running the TP code, confirming that you can't ping or telnet to the IP.

realserver:# ping 192.168.1.111
PING 192.168.1.111 (192.168.1.111) from 192.168.1.11 : 56(84) bytes of data.
From realserver.mack.net (192.168.1.11): Destination Host Unreachable

realserver:# telnet 192.168.1.111
Trying 192.168.1.111...
telnet: Unable to connect to remote host: No route to host

so add a route and try again (lo works here, eth0 doesn't)

realserver:# route add -host 192.168.1.111 lo
realserver:# telnet 192.168.1.111
Trying 192.168.1.111...
Connected to 192.168.1.111.
Escape character is '^]'.

Welcome to Linux 2.2.16.
realserver login:

This shows that you can connect to the new IP from the localhost. No transparent proxy involved yet.

If you go to another machine on the same network and add a route to the new IP.

client:# route add -host 192.168.1.111 gw 192.168.1.11
client:# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.1.111   192.168.1.11    255.255.255.255 UGH       0 0          0 eth0
192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo

raw sockets work between the client and server -

client:# traceroute 192.168.1.111
traceroute to 192.168.1.111 (192.168.1.111), 30 hops max, 40 byte packets
 1  server.mack.net (192.168.1.11)  0.634 ms  0.433 ms  0.561 ms

however you can't ping (i.e. icmp doesn't work) or telnet to that IP from the other machine.

client:# ping 192.168.1.111
PING 192.168.1.111 (192.168.1.111) from 192.168.1.9 : 56(84) bytes of data.
From realserver.mack.net (192.168.1.11): Time to live exceeded

client:# telnet 192.168.1.111
Trying 192.168.1.111...
telnet: Unable to connect to remote host: No route to host

Here's the output of tcpdump running on the target host

14:09:09.789132 client.mack.net.1101 > tip.mack.net.telnet: S 1088013012:1088013012(0) win 32120 <mss 1460,sackOK,timestamp 7632700[|tcp]> (DF) [tos 0x10]
14:09:09.791205 realserver.mack.net > client.mack.net: icmp: time exceeded in-transit [tos 0xd0]

(Anyone have an explanation for this, apart from the fact that icmp is not working? Is the lack of icmp the only thing stopping the telnet connect?)

The route to 192.168.1.111 is not needed for the next part.

realserver:# route del -host 192.168.1.111

Now add transparent proxy to the server to allow the realserver to accept connects to 192.168.1.111:telnet

This is the command for 2.2.x kernels

realserver:# ipchains -A input -j REDIRECT telnet -d 192.168.1.111 telnet -p tcp
realserver:# ipchains -L
Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
REDIRECT   tcp  ------  anywhere             192.168.1.111          any ->   telnet => telnet
Chain forward (policy ACCEPT):
Chain output (policy ACCEPT):

19.3.1. redirecting any port at all

In the normal functioning of an LVS, once the packet has been redirected, the director steps in and sends it to the realservers and the reply comes from the realservers. However you can use the REDIRECT to connect with a socket on a different port independantly of the LVS function.

Joe, 4 Jun 2001

If I have 2 boxes (not part of an LVS) and on the server box I run

$ipchains -A input -j REDIRECT telnet serverIP 81 -p tcp

then I can telnet to port 81 on the realserver box and have a normal telnet session. I watched with tcpdump on the server and all I see is a normal exchange of packets with dest-port=81.

I thought with REDIRECT that the packet with dest-port=81 was delivered to the listener on realserverIP:telnet. How does the telnetd know to return a packet with source-port=telnet?

Julian

This is handled from the protocol, TCP in this case:

grep redirport net/ipv4/*.c

The higher layer (telnet in this case) can obtain the two dest addr/ports by using getsockname(). In 2.4 this is handled additionally by using getsockopt(...SO_ORIGINAL_DST...)

The netfilter mailing list contains examples on this issue. You can search for "getsockname"

19.3.2. For 2.4.x kernels

server:# iptables -t nat -A PREROUTING -p tcp -d 192.168.1.111 --dport telnet -j REDIRECT
server:# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
REDIRECT   tcp  --  anywhere             192.168.1.111       tcp dpt:telnet
			</para><para>
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
			</para><para>
Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

You still can't ping the transparent proxy IP on the server from the client

client:# ping 192.168.1.111
PING 192.168.1.111 (192.168.1.111) from 192.168.1.9 : 56(84) bytes of data.
From server.mack.net (192.168.1.11): Time to live exceeded

The transparent proxy IP on the server will accept telnet connects

client:# telnet 192.168.1.111
Trying 192.168.1.111...
Connected to 192.168.1.111.
Escape character is '^]'.

Welcome to Linux 2.2.16.
server login:

but not requests to other services

client:# ftp 192.168.1.111
ftp: connect: No route to host
ftp>

Conclusion: The new IP will only accept packets for the specified service. It won't ping and it won't accept packets for other services.

19.4. The original 2.2 TP setup method

                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                        (router)
                           |
                 VIP=192.168.1.110 (eth0, arps)
                      __________
                     |          |
                     | director |
                     |__________|
                     DIP=192.168.1.1 (eth1, arps)
                           |
                           |
          -------------------------------------
          |                |                  |
  RIP1=192.168.1.2  RIP2=192.168.1.3   RIP3=192.168.1.4 (eth0)
   _____________     _____________      _____________
  |             |   |             |    |             |
  | realserver  |   | realserver  |    | realserver  |
  |_____________|   |_____________|    |_____________|
          |                |                  |
      (router)          (router)           (router)
          |                |                  |
          ----------------------------------------------> to client

Here's a script to run on 2.2.x realservers/directors to setup Horms' method. This is incorporated into the configure script.

#!/bin/sh
#rc.horms
#script by Joseph Mack and Horms (C) 1999, released under GPL.
#Joseph Mack jmack (at) wm7d (dot) net, Horms horms (at) vergenet (dot) net
#This code is part of the Linux Virtual Server project
#http://www.linuxvirtualserver.org
#
#
#Horm's method for solving the LVS arp problem for a LVS-DR LVS.
#Uses ipchains to redirect a packet destined for an external
#machine (in this case the VIP) to the local device.

#-----------------------------------------------------
#Instructions:
#
#1. Director: Setup normally (eg turn on LVS services there with ipvsadm).
#2. Realservers: Must be running 2.2.x kernel.
# 2.1 recompile the kernel (and reboot) after turning on the following under "Networking options"
#       Network firewalls
#       IP: firewalling
#       IP: transparent proxy support
#       IP: masquerading
# 2.2 Setup the realserver as if it were a regular leaf node on the network,
#      <emphasis>i.e.</emphasis> with the same gateway and IP as if it were in the LVS, but DO NOT
#      put the VIP on the realserver. The realserver will only have its regular IP
#      (called the RIP in the HOWTO).
#3. Edit "user configurable" stuff below"
#4. Run this script
#-----------------------------------------------------
#user configurable stuff

IPCHAINS="/sbin/ipchains"
VIP="192.168.1.110"

#services can be represented by their name (in /etc/services) or a number
#SERVICES is a quote list of space separated strings
# eg SERVICES="telnet"
#    SERVICES="telnet 80"
#    SERVICES="telnet http"
#Since the service is redirected to the local device,
#make sure you have SERVICE listening on 127.0.0.1
#
SERVICES="telnet http"
#
#----------------------------------------------------
#main:

#turn on IP forwarding (off by default in 2.2.x kernels)
echo "1" > /proc/sys/net/ipv4/ip_forward

#flush ipchains table
$IPCHAINS -F input

#install SERVICES
for SERVICE in $SERVICES
do
        {
        echo "redirecting ${VIP}:${SERVICE} to local:${SERVICE}"
        $IPCHAINS -A input -j REDIRECT $SERVICE -d $VIP $SERVICE -p tcp
        }
done

#list ipchain rules
$IPCHAINS -L input

#rc.horms----------------------------------------------

Here's the conf file for a LVS-DR LVS using TP on both the director and the realservers. This is for a 2.2.x kernel director. (For a 2.4.x director, the VIP device can't be TP - TP doesn't work on a 2.4.x director).

#-------------------------------------
#lvs_dr.conf for TP on director and realserver
#you will have to add a host route or equivelent on the client/router
#so that packets for the VIP are routed to the director
LVS_TYPE=VS_DR
INITIAL_STATE=on
#note director VIP device is TP
VIP=TP lvs 255.255.255.255 lvs
DIP=eth0 dip 192.168.1.0 255.255.255.0 192.168.1.255
DIRECTOR_DEFAULT_GW=client
SERVICE=t telnet rr realserver1 realserver2
#note realserver VIP device is TP
SERVER_VIP_DEVICE=TP
SERVER_NET_DEVICE=eth0
SERVER_DEFAULT_GW=client
#----------end lvs_dr.conf------------------------------------

Here's the output from ipchains -L showing the redirects for just the 2.2.x director

Chain input (policy ACCEPT):
target     prot opt     source                destination           ports
REDIRECT   tcp  ------  anywhere             lvs2.mack.net         any ->   telnet => telnet
REDIRECT   tcp  ------  anywhere             lvs2.mack.net         any ->   telnet => telnet
REDIRECT   tcp  ------  anywhere             lvs2.mack.net         any ->   www => www
REDIRECT   tcp  ------  anywhere             lvs2.mack.net         any ->   www => www
Chain forward (policy ACCEPT):
Chain output (policy ACCEPT):

19.5. Transparent proxy for 2.4.x (and presumably 2.6.x)

For 2.4.x kernels transparent proxy is built on netfilter and is installed with ip_tables (not ipchains as with 2.2.x kernels).

Note

You need ip_tables support in the kernel and the ip_tables module must be loaded. The ip_tables module is incompatible with the ipchains module (which in 2.4.x is available for compatibility with scripts written for 2.2.x kernels). If present, the ipchains module must be unloaded. You shouldn't be running ipchains on 2.4.x kernels anymore and you should have changed over to ip_tables.

Unfortunately the transparent proxy that comes with 2.4 kernels does not work for LVS. The packet arrives locally with the IP of the NIC which accepts the packet, rather than with an unchanged IP (the VIP). This still allows a squid to work, but is useless for LVS. The netfilter people didn't realise that someone (i.e. LVS) had found a use for the original behaviour and it was dropped from the 2.4 code.

Balazs Scheidler bazsi (at) balabit (dot) hu has written a netfilter patch which restores the original functionality of tproxy, for the firewall Zorp (note: no-one has tested it with LVS yet). Here is Balazs' 2.4 transparent proxy patches README. (In previous HOWTO's, I incorrectly attributed the patch to Ratz. My apologies to Balazs. Ratz has written a tproxy patch for LVS as part of his job, but he is not allowed to release the code - it seems I confused the two patches.)

Mike McLean mikem (at) redhat (dot) com 04 Dec 2002

The patch for 2.4 kernels should be shipped by RedHat. If not please file a bug at bugzilla.redhat.com.

If RedHat is patched with Balazs' code, then it is possible that it has been tested with LVS (RedHat doesn't necessarily test their released code).

(Dec 2002). Nearly all the following section is me figuring out that TP for 2.4 doesn't work for LVS. It will have to be rewritten as Balazs's patches are incorporated into LVS. (Mar 2006, seems like noone is using them.)

The command for installing transparent proxy with iptables for 2.4.x came from looking in Daniel Kiracofe's drk (at) unxsoft (dot) com Transparent Proxy with Squid mini-HOWTO and guessing the likely command. It turns out to be

director:# iptables -t nat -A PREROUTING [-i $SERVER_NET_DEVICE] -d $VIP -p tcp \
	--dport $SERVICE -j REDIRECT

(where $SERVICE = telnet, $SERVER_NET_DEVICE = eth0).

Here's the result of installing the VIP by transparent proxy on one of the realservers.

realserver:~# iptables -L -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
REDIRECT   tcp  --  anywhere             lvs2.mack.net      tcp dpt:telnet
REDIRECT   tcp  --  anywhere             lvs2.mack.net      tcp dpt:http

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

This works fine for the realserver allowing it to accept packets for the VIP, without having the VIP on an ethernet device (eg lo, eth0).

With the problems of 2.4 kernel TP for the VIP on the director, people seem to have forgotten that TP will still allow the realserver to accept packets for the VIP, solving the arp problem. Bill Omer rediscovered this a few years later

Bill Omer bill (dot) omer (at) gmail (dot) com 2 Mar 2006

Here's my setup with all the nitty gritty. I'm using rhel3as, I have all of the lvs portions of the kernel compiled as modules.

the director part is straight forward:

ifconfig eth0:0 cvg1-lvs-vip netmask 255.255.255.255 broadcast cvg1-lvs-vip up
ipvsadm -A -t cvg1-lvs-vip:0 -s wlc -p
ipvsadm -a -t cvg1-lvs-vip:0 -r cvg1-app-101 -g
ipvsadm -a -t cvg1-lvs-vip:0 -r cvg1-app-102 -g
ipvsadm -a -t cvg1-lvs-vip:0 -r cvg1-app-103 -g
ipvsadm -a -t cvg1-lvs-vip:0 -r cvg1-app-104 -g
ipvsadm -a -t cvg1-lvs-vip:0 -r cvg1-app-105 -g

The realserver(s)

RIP:
iptables -t nat -F
iptables -t nat -A PREROUTING -d cvg1-lvs-vip -p tcp --dport 0:65535 
-j REDIRECT
echo 1 > /proc/sys/net/ipv4/ip_forward

As far as ports 0:65535 goes, I know its a security risk. It's as secure as the RIP's them self. I plan on having about 30-40 thin clients book up over the network (PXE, which I'd like to in time be lvs'd) to an xdm. After I get some stress testing done and pin point some more bugs here and there, I'll narrow down the port range to be a lil more complaint to rudimentary security measures. However, everything is being ran over a local lan and nothing is exposed to the wild wild web.

If you do the same with TP on the director, setup for an LVS with (say) telnet forwarded in the ipvsadm tables, then the telnet connect request from the client is accepted by the director, rather than forwarded by ipvs to the realservers (tcpdump sees a normal telnet login to the director). Apparently ipchains is sending the packets to a place that ipvs can't get at them.

Joe

I have got TP to work on a LVS-DR telnet 2.4 realserver with the command

#iptables -t nat -A PREROUTING -p tcp -d $VIP --dport telnet -j REDIRECT

When I put the VIP onto the director this way, the LVS doesn't work. I connect to the director instead of the realservers. ipvsadm doesn't show any connections (active or otherwise)

If I run the same command on the director, with ipvsadm blank (ie no LVS configured), then I connect to the director from the client (as expected) getting the director's telnet login.

I presume that I'm coming in at the wrong place in the input chain of the director and ipvsadm is not seeing the packets?

Julian

I haven't tried tproxy in 2.4 but in theory it can't work. The problem is that netfilter implements tproxy by mangling the destination address in the prerouting. LVS requires from the tproxy implementation only to deliver the packet locally and not to alter the header. So, I assume LVS detects the packets with daddr=local_addr and refuses to work.

Netfilter maintains a sockopt SO_ORIGINAL_DST that can be used from the user processes to obtain the original dest addr/port before they are mangled in the pre routing nat place. This can be used from the squids, for example, to obtain these original values.

If LVS wants to support this broken tproxy in netfilter we must make a lookup in netfilter to receive the original dst and then again to mangle (for 2nd time) the dst addr/port. IMO, this is very bad and requires LVS always to require netfilter nat because it will always depend on netfilter: LVS will be compiled to call netfilter functions from its modules.

So, the only alternative remains to receive packets with advanced routing with fwmark rules. There is one problem in 2.2 and 2.4 when the tproxy setups must return ICMP to the clients (they are internal in such setup), for example, when there is no realserver LVS returns ICMP DEST_UNREACH:PORT_UNREACH. In this case both kernels mute and don't return the ICMP. icmp_send() drops it. I contacted Alexey Kuznetsov, the net maintainer, but he claims there are more such places that must be fixed and "ip route add table 100 local 0/0 dev lo" is not a good command to use. But in my tests I don't have any problems, only the problem with dropped ICMP replies from the director.

So, for TP, I'm not sure if we can support it in the director. May be it can work for the realservers and even when the packet is mangled I don't expect peformance problems but who knows.

19.6. Experiments showing that 2.4TP is different to 2.2TP

These experiments were conducted with 2.2 or 2.4 kernel realservers accepting packets for the VIP by TP. I initially noticed that the connection to 2.4 realservers was not delayed by identd (which is running on my realservers). What was happening was that the realserver was accepting the packet at the RIP and generating the reply from the RIP, rather than the VIP. On my setup, the RIP is routable to the client and the client probably received the identd request directly from the realserver (I didn't figure out what was going on for a while after I did this. I originally thought this had something to do with identd).

Here's the data showing that TP behaves differently for 2.2 and 2.4 kernels. If you want to skip ahead, the piece of information you need is that the IP of the packet when it arrives on the target machine by TP, is different for 2.2 and 2.4 TP.

As we shall see, for 2.2.x the TP'ed packets arrive on the VIP, while for 2.4.x, the TP'ed packets arrive on the RIP.

19.6.1. Realserver, Linux 2.4.2 kernel accepting packets for VIP on lo:110, is delayed

Here's the tcpdump on the realserver (RS2) for a telnet request delayed by authd (the normal result for LVS). Realserver 2.4.2 with Julian's hidden patch, director 0.2.5-2.4.1. The VIP on the realserver is on lo:110.

Note: all packets on the realserver are originating and arriving on the VIP (lvs2) as expected for a LVS-DR LVS.

initial telnet request

21:04:46.602568 client2.1174 > lvs2.telnet: S 461063207:461063207(0) win 32120 <mss 1460,sackOK,timestamp 17832675[|tcp]> (DF) [tos 0x10]
21:04:46.611841 lvs2.telnet > client2.1174: S 3724125196:3724125196(0) ack 461063208 win 5792 <mss 1460,sackOK,timestamp 514409[|tcp]> (DF)
21:04:46.612272 client2.1174 > lvs2.telnet: . ack 1 win 32120 <nop,nop,timestamp 17832676 514409> (DF) [tos 0x10]
21:04:46.613965 client2.1174 > lvs2.telnet: P 1:28(27) ack 1 win 32120 <nop,nop,timestamp 17832676 514409> (DF) [tos 0x10]
21:04:46.614225 lvs2.telnet > client2.1174: . ack 28 win 5792 <nop,nop,timestamp 514409 17832676> (DF)

realserver makes authd request to client

21:04:46.651500 lvs2.1061 > client2.auth: S 3738365114:3738365114(0) win 5840 <mss 1460,sackOK,timestamp 514413[|tcp]> (DF)
21:04:49.651162 lvs2.1061 > client2.auth: S 3738365114:3738365114(0) win 5840 <mss 1460,sackOK,timestamp 514713[|tcp]> (DF)
21:04:55.651924 lvs2.1061 > client2.auth: S 3738365114:3738365114(0) win 5840 <mss 1460,sackOK,timestamp 515313[|tcp]> (DF)

after delay of 10secs, telnet request continues

21:04:56.687334 lvs2.telnet > client2.1174: P 1:13(12) ack 28 win 5792 <nop,nop,timestamp 515416 17832676> (DF)
21:04:56.687796 client2.1174 > lvs2.telnet: . ack 13 win 32120 <nop,nop,timestamp 17833684 515416> (DF) [tos 0x10]

19.6.2. Realserver, Linux 2.4.2, accepting packets for VIP by TP, is not delayed

Here's the tcpdump on the realserver (RS2) for a telnet request which connects immediately. This is not the normal result for LVS. Realserver 2.4.2 with Julian's hidden patch (not used), director 0.2.5-2.4.1. Packets on the VIP are being accepted by TP rather than on lo:0 (the only difference).

Note: some packets on the realserver (RS2) are arriving and originating on the VIP (lvs2) and some on the RIP (RS2). In particular all telnet packets from the CIP are arriving on the RIP, while all telnet packets from the realserver are originating on the VIP. For authd, all packets to and from the realserver are using the RIP.

initial telnet request

20:56:43.638602 client2.1169 > RS2.telnet: S 4245054245:4245054245(0) win 32120 <mss 1460,sackOK,timestamp 17784379[|tcp]> (DF) [tos 0x10]
20:56:43.639209 lvs2.telnet > client2.1169: S 3234171121:3234171121(0) ack 4245054246 win 5792 <mss 1460,sackOK,timestamp 466118[|tcp]> (DF)
20:56:43.639654 client2.1169 > RS2.telnet: . ack 3234171122 win 32120 <nop,nop,timestamp 17784380 466118> (DF) [tos 0x10]
20:56:43.641370 client2.1169 > RS2.telnet: P 0:27(27) ack 1 win 32120 <nop,nop,timestamp 17784380 466118> (DF) [tos 0x10]
20:56:43.641740 lvs2.telnet > client2.1169: . ack 28 win 5792 <nop,nop,timestamp 466118 17784380> (DF)

realserver makes authd request to client

20:56:43.690523 RS2.1057 > client2.auth: S 3231319041:3231319041(0) win 5840 <mss 1460,sackOK,timestamp 466123[|tcp]> (DF)
20:56:43.690785 client2.auth > RS2.1057: S 4243940839:4243940839(0) ack 3231319042 win 32120 <mss 1460,sackOK,timestamp 17784385[|tcp]> (DF)
20:56:43.691125 RS2.1057 > client2.auth: . ack 1 win 5840 <nop,nop,timestamp 466123 17784385> (DF)
20:56:43.692638 RS2.1057 > client2.auth: P 1:10(9) ack 1 win 5840 <nop,nop,timestamp 466123 17784385> (DF)
20:56:43.692904 client2.auth > RS2.1057: . ack 10 win 32120 <nop,nop,timestamp 17784385 466123> (DF)
20:56:43.797085 client2.auth > RS2.1057: P 1:30(29) ack 10 win 32120 <nop,nop,timestamp 17784395 466123> (DF)
20:56:43.797453 client2.auth > RS2.1057: F 30:30(0) ack 10 win 32120 <nop,nop,timestamp 17784395 466123> (DF)
20:56:43.798336 RS2.1057 > client2.auth: . ack 30 win 5840 <nop,nop,timestamp 466134 17784395> (DF)
20:56:43.799519 RS2.1057 > client2.auth: F 10:10(0) ack 31 win 5840 <nop,nop,timestamp 466134 17784395> (DF)
20:56:43.799738 client2.auth > RS2.1057: . ack 11 win 32120 <nop,nop,timestamp 17784396 466134> (DF)

telnet connect continues, no delay

20:56:43.835153 lvs2.telnet > client2.1169: P 1:13(12) ack 28 win 5792 <nop,nop,timestamp 466137 17784380> (DF)
20:56:43.835587 client2.1169 > RS2.telnet: . ack 13 win 32120 <nop,nop,timestamp 17784399 466137> (DF) [tos 0x10]

Evidently TP on the realserver is making the realserver think that the packets arrived on the RIP, hence the authd call is made from the RIP.

As it happens in my test setup, the client can connect directly to the RIP. (In a LVS-DR LVS, the client doesn't exchange packets with the RIP, so I haven't blocked this connection. In production, the router would not allow these packets to pass). Since the authd packets are between the RIP and CIP, the authd exchange can proceed to completion.

19.6.3. Realserver, Linux 2.2.14, accepting packets for VIP by TP, is delayed

Here's the tcpdump on the realserver (RS2) for a telnet request which connects immediately. This is not the normal result for LVS. Realserver 2.2.14, director 0.2.5-2.4.1. Packets on the VIP are being accepted by TP rather than on lo:0.

Note: TP is different in 2.2 and 2.4 kernels. Unlike the case for the 2.4.2 realserver, the packets all arrive at the RIP.

initial telnet request

22:16:23.407607 client2.1177 > lvs2.telnet: S 707028448:707028448(0) win 32120 <mss 1460,sackOK,timestamp 18262396[|tcp]> (DF) [tos 0x10]
22:16:23.407955 lvs2.telnet > client2.1177: S 3961823491:3961823491(0) ack 707028449 win 32120 <mss 1460,sackOK,timestamp 21648[|tcp]> (DF)
22:16:23.408385 client2.1177 > lvs2.telnet: . ack 1 win 32120 <nop,nop,timestamp 18262396 21648> (DF) [tos 0x10]
22:16:23.410096 client2.1177 > lvs2.telnet: P 1:28(27) ack 1 win 32120 <nop,nop,timestamp 18262396 21648> (DF) [tos 0x10]
22:16:23.410343 lvs2.telnet > client2.1177: . ack 28 win 32120 <nop,nop,timestamp 21648 18262396> (DF)

authd request from realserver

22:16:23.446286 lvs2.1028 > client2.auth: S 3966896438:3966896438(0) win 32120 <mss 1460,sackOK,timestamp 21652[|tcp]> (DF)
22:16:26.445701 lvs2.1028 > client2.auth: S 3966896438:3966896438(0) win 32120 <mss 1460,sackOK,timestamp 21952[|tcp]> (DF)
22:16:32.446212 lvs2.1028 > client2.auth: S 3966896438:3966896438(0) win 32120 <mss 1460,sackOK,timestamp 22552[|tcp]> (DF)

after delay of 10secs, telnet proceeds

22:16:33.481936 lvs2.telnet > client2.1177: P 1:13(12) ack 28 win 32120 <nop,nop,timestamp 22655 18262396> (DF)
22:16:33.482414 client2.1177 > lvs2.telnet: . ack 13 win 32120 <nop,nop,timestamp 18263404 22655> (DF) [tos 0x10]

19.7. What IP TP packets arriving on?

Note: for TP, there is no VIP on the realservers as seen by ifconfig.

Since telnetd on the realservers listens on 0.0.0.0, we can't tell which IP the packets have on the realserver after being TP'ed. tcpdump only tells you the src_addr after the packets have left the sending host.

Here's the setup for the test.

The IP of the packets after arriving by TP was tested by varying the IP (localhost, RIP or VIP) that the httpd listens to on the realservers. At the same time the base address of the web page was changed to be the same as the IP that the httpd was listening to. The nodes on each network link can route to and ping each other (eg 192.168.1.254 and 192.168.1.12).

        ____________
       |            |192.168.1.254 (eth1)
       |  client    |----------------------
       |____________|                     |
     CIP=192.168.2.254 (eth0)             |
              |                           |
              |                           |
     VIP=192.168.2.110 (eth0)             |
        ____________                      |
       |            |                     |
       |  director  |                     |
       |____________|                     |
     DIP=192.168.1.9 (eth1, arps)         |
              |                           |
           (switch)------------------------
              |
     RIP=192.168.1.12 (eth0)
     VIP=192.168.2.110 (LVS-DR, lo:0, hidden)
        _____________
       |             |
       | realserver  |
       |_____________|

The results (LVS-DR LVS) are

For 2.2.x realservers

  • the httpd can bind to the VIP, RIP and localhost.
  • LVS client gets webpage if realserver is listening to RIP or VIP.
  • LVS client does not get webpage if realserver is listening to localhost.

For 2.4.x realservers

  • httpd can bind to the RIP and localhost.
  • httpd cannot bind to the VIP.
  • LVS client gets webpage if realserver is listening to RIP.
  • LVS client does not get webpage if realserver is listening to localhost.

During tests, the browser says "connecting to VIP", then says "transferring from..."

  • LVS-DR, VIP on TP, kernel 2.4.2, "transferring data from RIP"
  • LVS-DR, VIP on TP, kernel 2.2.14, "transferring data from VIP" (or RIP)
  • LVS-DR VIP on lo:0, httpd listening to VIP, "transferring data from VIP"
  • LVS-Tun VIP on tunl0:0, httpd listening on VIP, "transferring from VIP"
  • LVS-NAT, httpd listening on RIP, "transferring data from realserver1" (or realserver2)

Some of these connections are problematic. The client in a LVS-DR LVS isn't supposed to be getting packets from the RIP. What is happening is

  • the httpd on the realserver is listening on the RIP
  • the base address of the webpage is the RIP
  • an incoming request from the client to the VIP will retrieve a webpage with references to gif etc that are at the RIP
  • the client will then ask for the gifs from the RIP.
  • in the above setup that I use for testing, the client does not request packets from the RIP.
  • in the above setup, the client can connect to the RIP directly (this will not be allowed in a production server, either the router will prevent the connection, or the RIP will be a non-routable IP).
  • the client retrieves the gifs, and the rest of the page, directly from the realserver

The way to prevent this is to remove the route on the client to the RIP network (eg see removing routes not needed for LVS-DR). Doing so when the httpd is listening to the RIP and the base address is the RIP causes the browser on the client to hang. This shows that the client is really retrieving packets directly from the RIP. Changing the base address of the webpage back to the VIP allows the webpage to be delivered to the client, showing that the client is now retrieving packets by making requests to the VIP via the director.

It would seem then that with 2.4 TP, the realserver is receiving packets on the RIP, rather than the VIP as it does with 2.2 TP. With a service listening to only 1 port (eg httpd) then the httpd has to

  • listen on the RIP
  • the addresses on the webpage have to be for the VIP

The client will then ask for the webpage at the VIP. The realserver will accept this request on the RIP and return a webpage full of references to the VIP (eg gifs). The client will then ask for the gifs from the VIP. The realserver will accept the requests on the RIP and return the gifs.

19.8. Take home lesson for setting up TP on realservers

19.8.1. 2.2.x

Let httpd listen on VIP or RIP, return pages with references to VIP

19.8.2. 2.4.x

Let httpd listen on RIP, return pages with references to VIP.

19.9. Handling identd requests from 2.4.x LVS-DR realservers using TP

Since the identd request is coming from the RIP (rather than the VIP) on the realserver, you can use Julian's method for NAT'ing client requests from realservers.

19.10. Performance of Transparent Proxy

Using transparent proxy instead of a regular ethernet device has slightly higher latency, but the same maximum throughput.

For performance of transparent proxy compared to accepting packets on an ethernet device see the performance page.

Transparent proxy requires reprocessing of incoming packets, and could have a similar speed penalty as LVS-NAT. However only the incoming packets are reprocessed. Initial results (before the performance tests above) were initially not encouraging.

Doug Bagley doug (at) deja (dot) com

Subject: [lvs-users] chosen arp problem solution can apparently affect performance

I was interested in seeing if the linux/ipchains workaround for the arp problem would perform just as well as the arp_invisible kernel patch. It is apparently much worse.

I ran a test with one client running ab ("apache benchmark"), one director, and one realserver running Apache. They are all various levels of pentium desktop machines running 2.2.13.

Using the arp_invisible patch/dummy0 interface, I get 226 HTTP requests/second. Using the ipchains redirect method, I get 70 requests per second. All other things remained the same during the test.

See the performance page for discussion and sample graphs of hits/sec for http servers. Hits/sec can increase to high levels as the payload decreases in size. While large numbers for hits/sec may be impressive, they only indicate one aspect of a web server's performance. If large (> 1 packet) files are transferred/hit or computation is involved, then hits/sec is not a useful measure of web performance.

Here's the current explanation for decreased latency of transparent proxy.

Kyle Sparger ksparger (at) dialtoneinternet (dot) net

Logically, it's just a function of the way the redirect code operates.

Without redirect:
Ethernet -> TCP/IP -> Application -> TCP/IP -> Ethernet

With redirect:
Ethernet -> TCP/IP -> Firewall/Redirect Code -> TCP/IP -> Application -> TCP/IP -> Ethernet

That would definitely explain the slowdown, since _every single packet_ received is going to go through these extra steps.

Other people are happy with TP

Jerry Glomph Black black (at) real (dot) com Nov 99 (or thereabouts)

The revival of Horms' posting, which I overlooked a month ago, was a lifesaver for us. We had a monster load distribution problem, and spread 4 virtual IP numbers across 10 'real' boxes (running Roxen, a fantastic web platform). The ipchains-REDIRECT feature works perfectly, without any of that arp aggravation! A PII_450 held up just fine at 20 megabits/s of HTTP -REQUEST- TRAFFIC!

Here's Jerry 18 months later.

Jerry Glomph Black black (at) prognet (dot) com 06 Jul 2001

The ipchains/iptables REDIRECT method (introduced to this list by Mr Horms a long time ago) works fine, we've used it in production in the past.

However, at -very- high packet loads it is far less CPU-efficient than getting the ARP settings correctly working. The REDIRECT method was bogging down our LVS boxes during peak traffic, something which does not happen with doing it the 'right way' with LVS-DR and silent arp-less interfaces on the real servers.

19.11. The difference between REDIRECT and TPROXY

Horms

REDIRECT works by changing the destination IP address to a local address so that it ends up in the LOCAL_IN chain.

Note
REDIRECT with 2.2 kernels was the original basis for "Horm's method"

Joe Oct 03, 2003

is the original 2.4.x REDIRECT disaster (see TP_2.4_problems) fixed now?

TPROXY looks like it would work because it is completely different from REDIRECT and uses its own connection tracking. REDIRECT uses netfilter's internal connection tracking routines. Because of the way that LVS is implemented, these do not work for packets that are handled by LVS. Thus the connection tracking for REDIRECT does not work. Thus the return packets from the realservers are not modified and the connection fails. From my reading TPROXY uses its own connection tracking routines (though for what reason I am not sure). These routines probably aren't effected by LVS and thus TPROXY should work.

N.B: I have not verified this.