LVS-Tun is an LVS original. It is based on LVS-DR. The LVS code encapsulates the original packet (CIP->VIP) inside an ipip packet of DIP->RIP, which is then put into the OUTPUT chain, where it is routed to the realserver. (There is no tunl0 device on the director; ip_vs() does its own encapsulation and doesn't use the standard kernel ipip code. This possibly is the reason why PMTU on the director does not work for LVS-Tun - see MTU.) The realserver receives the packet on a tunl0 device (see need tunl0 device) and decapsulates the ipip packet, revealing the original CIP->VIP packet.
Initially only Linux could decapsulate IPIP packets, but recently FreeBSD and w2k can now do it too (hmm 2005, Microsoft has dropped support for IPIP).
If you want to try a test LVS-Tun setup on the bench, take a standard LVS-DR setup LVS-DR example, change lo on the realservers to tunl0 (and handle the ARP problem on tunl0) and change the ipvsadm switch from -g to -i . If your clients are going to be sending large packets, you need to set the MTU (see MTU for the ipip packet DIP->RIP). This can be done on the realserver with iptables (see tunl MTU solved) or iproute2 (see setting the MTU by route).
As with LVS-DR, the director doesn't know about the VIP on the realserver (it only knows about the RIP). Health checking of a service listening on the VIP on the realserver then must use a connection between the DIP and the RIP (if the demon is listening on both the RIP and DIP, the service listening on the RIP can be a proxy for the service listening on the VIP).
LVS-Tun allows the realservers to be geographically remote from the director (this is the main point of LVS-Tun). If your realservers cannot do ipip decapsulation, you can still have geographically remote realservers using other techniques (see non tunnelling realservers).
(see also Julian's LVS-Tun write up and postings to the mailing list).
Here's an example set of IPs for a LVS-Tun setup. For (my) convenience the servers are on the same network as the client. The only restrictions for LVS-Tun with remote hosts are that the client must be able to route to the director and that the realservers must be able to route to the client (the return packets to the client come directly from the realservers and do not go back through the director).
Normally for LVS-Tun, the client is on a different network to the director/server(s), and each server has its own route to the outside world. In the simple test case below where all machines are on the 192.168.1.0 network there would be no default route for the servers, and routing for packets from the servers to the client would use the device on the 192.168.1.0 network (presumably eth0). In reallife, the realservers would have their own router/connection to the internet and packets returning to the client would go through this router. In any case reply packets do not go back through the director.
Machine IP client CIP=192.168.1.254 director DIP=192.168.1.1 VIP=192.168.1.110 (arps, IP clients connect to) realserver-1 RIP1=192.168.1.2, VIP (tunl0, non-arping, 192.168.1.110) realserver-2 RIP2=192.168.1.3, VIP (tunl0, non-arping, 192.168.1.110) realserver-3 RIP3=192.168.1.4, VIP (tunl0, non-arping, 192.168.1.110) . . realserver-n RIPn=192.168.1.n+1, VIP (tunl0, non-arping, 192.168.1.110) |
#lvs_tun.conf LVS_TYPE=VS_TUN INITIAL_STATE=on VIP=eth0:110 192.168.1.110 255.255.255.255 192.168.1.110 DIP=eth0 192.168.1.9 192.168.1.0 255.255.255.0 192.168.1.255 DIRECTOR_DEFAULT_GW=client SERVICE=t telnet rr realserver1 realserver2 SERVER_VIP_DEVICE=tunl0 SERVER_NET_DEVICE=eth0 SERVER_DEFAULT_GW=client #----------end lvs_tun.conf------------------------------------ |
________ | | | client | |________| CIP=192.168.1.254 | CIP->VIP | | ^ v | | VIP->CIP | VIP=192.168.1.110 | (eth0:1, arps) | __________ | | | | | director |------- |__________| | DIP=192.168.1.1 | (eth0) | | DIP->RIP(CIP->VIP) | | v ------------------------------------- | | | | | | RIP1=192.168.1.2 RIP2=192.168.1.3 RIP3=192.168.1.4 (eth0) VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all tunl0,non-arping) _____________ _____________ _____________ | | | | | | | realserver | | realserver | | realserver | |_____________| |_____________| |_____________| |
Here's a likely production setup (I haven't done this one myself). It assumes the realservers are on a different network to the DIP. Here x.x.x.? and y.y.y.? are public IPs. The 176 and 10 addresses are for communication between the different locations and will be assigned by the ISP.
________ | | | client | |________| CIP=x.x.x.1 | CIP->VIP | |--------------------------------- v | | __________ | | | | | D-router | | |__________| | | | CIP->VIP | | | v | | | | VIP=y.y.y.110(eth0, arps) | __________ | | | | | director | | |__________| | DIP=176.0.0.1 (eth1) | | ^ | DIP->RIP1(CIP->VIP) | | VIP->CIP | | v | | __________ __________ | | | | | R-router | R,C-Router do not | C-Router | |__________| advertise VIP |__________| | | | ^ | DIP->RIP1(CIP->VIP) | | VIP->CIP | | v | | | | ---------------------------------------------------- | | | | RIP1=10.0.0.1(eth0) RIP2=10.0.0.2(eth0) VIP=y.y.y.110(tunl0) VIP=y.y.y.110(tunl0) | | _________________ ___________________ | | | | | realserver | | realserver | | tunl0: CIP->VIP | | | | eth0: VIP->CIP | | | |_________________| |___________________| |
Note | |
---|---|
tunl0 is a networking device like eth0, lo, and dummy0. |
In LVS-Tun, the tunl0 device holds the VIP, just as the lo device holds the device for LVS-DR. You need to build the tunl0 device into the Linux kernel (in networking options - IP:tunneling) - it is turned off by default. The tunnelling (ipip) can be built as a module, in which case you'll have to insmod ipip before you can use it, or you can build ipip directly into the kernel. With a kernel enabled for ipip, you should be able to see the unconfigured tunl0 device with ifconfig or with ip addr show (Feb 2004 - my ifconfig used to see the unconfigured tunl0, but it doesn't anymore.)
Then you configure the tunl0 device (even if ifconfig can't see it).
ifconfig tunl0 192.168.1.110 netmask 255.255.255.255 broadcast 192.168.1.110 |
when the tunl0 device becomes visible to ifconfig
or
ip addr add dev tunl0 192.168.1.110/32 brd 192.168.1.110 |
Note | |
---|---|
the VIP is a /32 addr, so the brd addr is the VIP, not x.x.x.255. |
If the realservers and director are on a different network (e.g. the realservers are geographically remote), then the router infront of the realservers will not be advertising routes to the VIP and you won't need to handle the ARP problem on the realservers. In effect you are using Lars' method without having to do anything special.
If the realservers are using the same router as the director you need to handle the ARP problem for the realservers (set tunl0 to not reply to arp queries). This networking is the same as for LVS-DR and you'd only do this to test LVS-Tun. (there's no other reason to use LVS-Tun with the LVS-DR network). However all my LVS-Tun test cases used the same networking as for LVS-DR, i.e. the DIP and RIPs were on the same network and only one router (actually none, the client with 1 or 2 NICs, faced directly onto the director and realservers). In this case I had to handle the ARP problem for the realservers.
Unlike LVS-DR, with LVS-Tun the realservers can be in a different location (and on a network remote from the director), where the director and realservers will be on different networks and the realservers will be on a network that does NOT contain the VIP. If this is the case, the realservers will be generating reply packets with VIP:port->CIP (where port is the LVS'ed service). Not being on the VIP network, the routers for the realservers will have to be programmed to accept outgoing packets with src_addr=VIP:port. Routers normally drop these packets as an anti-spoofing measure. If you aren't in control of the routers, you'll just have to inform the people who are, that packets from VIP:port are valid for your business. If they don't want to help you with your business, then you should find another provider who will.
Here's part of the rc.lvs_tun script which configures the realserver with RIP=192.168.1.8
#setup servers for telnet /sbin/ipvsadm -A -t 192.168.1.110:23 -s rr /sbin/ipvsadm -a -t 192.168.1.110:23 -R 192.168.1.1 -i -w 1 |
There's no forwarding in the conventional sense for LVS-Tun. (You can have ip_forward set to ON if you need it for something else, but LVS-Tun doesn't need in ON. If you don't have a good reason to have it ON, then for security turn it OFF). For more explanation see design of ipvs for netfilter
#set ip_forward OFF for lvs-tun director (1 on, 0 off) cat /proc/sys/net/ipv4/ip_forward echo "0" >/proc/sys/net/ipv4/ip_forward |
As with LVS-DR, for LVS-Tun, the target port numbers of incoming packets cannot be remapped. A request to port 23 on the VIP will be forwarded to port 23 on a realserver, thus no port number is used for setting up the IP of the realserver. However you can still Re-mapping ports with LVS-Tun external to LVS, using iptables
Here's the packet headers as the request is processed by LVS-Tun.
packet source dest data 1. request from client CIP:3456 VIP:23 - 2. ipvsadm table: director chooses server=RIP1, encapsulates into IPIP packet DIP RIP1 IP datagram source=CIP:3456, dest=VIP:23, data= - 3. realserver recovers IP datagram CIP:3456 VIP:23 - 4. realserver looks up routing table, finds VIP is local, processes request locally, generates reply VIP:23 CIP:3456 "login: " 5. packet leaves realserver via default gw, not via DIP. |
For the verbally oriented...
A packet arrives at the director for the VIP. The director looks up its tables and decides to send the connection to realserver_1. The director encapsulates the request packet in an IPIP datagram with header DIP->RIP_1. The packet arrives at realserver_1, the realserver recovers the original IP datagram, looks up its routing table, finds that the VIP (on the non-arping tunl0) is local and processes the packet locally. A reply packet is generated with VIP:23->CIP:3456. The realserver looks up its routing table and finds that a packet to CIP goes out its default gw (not to the DIP).
The tunl0 device does not arp with 2.0.36 kernels, but does with 2.2.x (and later) kernels. Go look up the section on the The Arp Problem to see if you need to patch the kernel on the realserver. (Joe: since kernel 2.6.4 and 2.4.26, arp_ignore/arg_annouce are the preferred way of handling the arp problem.)
Joe
How does a packet get to a tunl device, which doesn't have a MAC address, from a remote machine?
Julian
tunl, lo and dummy are used just to configure the VIP. We don't send any packets through these devices. The requests are delivered to the realservers using their RIP. The director asks only about their RIP from ipvsadm. Only the router/gateway asks about VIP, but only the director must reply. When the packet is received in the realserver it is delivered locally (not forwarded or dropped) due to configured VIP. This is the only role of these "dummy" interfaces: the kernel to treat the received packet as it is destined to our host (the realserver). Nothing more. No IPIP encapsulations (for tunl), no MAC address definitions, nothing more. When we answer the request we use eth0. The tunl/lo/dummy is not selected as device for the outgoing packets. We have routes for eth0 (default gateway) which we use for the outgoing traffic. This is for DROUTE and TUNNEL mode.
If two linux boxes (not in an LVS) are joined by an IPIP tunnel and there is no MAC address associated with the tunl0 devices at each end of the link, then how do the packets get from one machine to the other?
Julian
The packets are encapsulated via IPIP and sent to the tunnel ends real IP where they are decapsulated again and appear on the tunl interface. You don't need a MAC address for point-to-point links, or logical interfaces like tunnels.
Edit the template lvs_tun.conf and run the configure script
$ ./configure_lvs.pl lvs_tun.conf |
Load the the parameters into the director and then the realservers with the command
$ . ./etc/rc.d/rc.lvs_tun |
(the script knows whether it is running on a realserver or the director).
(later put rc.lvs_tun in /etc/rc.d or /etc/init.d and put mon_xxx.cf in /etc/mon)
check the output from ipvsadm, ifconfig -a and netstat -rn, to check that the services/IP's are correct.
this is now in turn off rp_filter
Here's how to setup ipip encapulation in FreeBSD.
carla quiblat carlaq (at) asti (dot) dost (dot) gov (dot) ph 20 Jun 2002
First, gifs must be supported in your kernel (enable "pseudo-device gif" in your kernel config).
src_addr is the address of your NIC's interface while dest_addr is the remote side or the other end of the tunnel IP address. For example, if pc1 is one end of your tunnel and pc2 is the other end, then:
if in pc1, you have: xl0: 1.1.1.1 gif0: if in pc2, you have: de0: 1.1.2.1 gif0: on pc1, do the following: pc1# ifconfig gif0 1.1.1.1 1.1.2.1 on pc2, do the following: pc2# ifconfig gif0 1.1.2.1 1.1.1.1You can also man gifconfig .
I haven't tried using gif interfaces for IP-in-IP tunneling. I've only used them for IPv6 in IPv4 tunneling, but you can test it.
carla quiblat carlaq (at) asti (dot) dost (dot) gov (dot) ph 30 Jun 2002
I'd just like to report that I got LVS-Tun working for a Linux(as director)-OpenBSD(as realserver). I am currently testing LVS so we could use it to loadbalance web service requests (http) over different sites (different IPs/different blocks) therefore LVS-Tun is required.
I know FreeBSD implements tunneling but I've only used it for IPv6-in-IPv4 tunneling and I didn't quite understand how tunneling in Linux worked. For example, in linux to create a tunnel, you did this:
on the director: no tunnel is created because ipvs does the encapsulation
on the realserver:
ifconfig tunl0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110 route add -host 172.26.20.110 dev tunl0Basically, I understand that the tunl0 is identified with the remote tunnel end (VIP) but I don't understand the "route add" part since LVS-Tun only implements a one-way tunnel. That is, from the director to the realserver, tunneling from realserver-to-director is not required and seems useless. The realserver routes following it's default router path direct to the client. So that's where I got stuck. "How do you say this in *BSD using the gif0 interface, the one I'm familiar with?" In the end, this is the topology we'd like to implement:
-------- | client | -------- | | Internet | | LVS director, Linux | | ______________ -------...tunnel.....-->(one-way-tunnel)realserver, *BSD | -------------- | realserver(local-NAT), *BSDwith the tunneled packet routed normally through its routers/gateways (edge routers or other) down to the realserver.
My test setup looks like this:
[ client with a live IP ] -------gw------eth0(10.10.8.98, DIP) [director] | eth0:110 (VIP) | |___fxp0(10.10.8.199,RIP)[realserver]So what I did on the OpenBSD realserver is this,
ifconfig fxp0 10.10.8.199 netmask 255.255.255.0 up route add default 10.10.8.1 ifconfig gif0 tunnel 10.10.8.199 10.10.8.98 ifconfig lo0 _VIP netmask 255.255.255.25510.10.8.1 is the default gateway for the private network. Notice that the tunnel endpoint is the DIP (not VIP like in Linux). This is because as I understand, the packet that arrives at the realserver (encapsulated by ipvs) has this format:
[D|R|C|V|...payload....]where, D - director address, R - realserver address, C - client address, and V - VIP address. Decapsulation is done by the gif0 tunnel, after that it sees that the packet is destined to itself (VIP defined at its lo0 interface) and processes it normally with source IP= client IP.
When I do "telnet VIP" from the client, I successfully enter 10.10.8.199 after the login.
Note | |
---|---|
support for ipip was removed from M$ after w2k. Paolo has a solution for non tunnelling realservers using a spanned layer2 network. |
Johan Ronkainen jr (at) mpoli (dot) fi 10 Feb 2003
It's possible with w2k Server. You'll find necessary settings under Routing and Remote Access snap-in. First create new IP Tunnel under "Routing Interfaces", then select "New Interface" under IP Routing/General and put necessary settings there.
You'll also need Loopback interface so w2k will handle packets itself and won't try to route them. Open Control Panel, click Add New Hardware, navigate thru dialogs and finally select Microsoft Loopback Adapter. If you want /32 network for loopback adapter you need to change it with regedit since GUI allows only /31. Network code itself is fine with /32 subnets.
It's been a while since I did this. We load-balanced w2k Terminal Server clients to three servers. Two were on same building as clients and third one was in different city on separate subnet. Clients connected to LVS that forwarded 2/3 of connections to local servers using LVS/DR and 1/3 to remote location using LVS/Tun via IP-tunnel. Replies were routed directly to clients.
This never went to full production and servers have been re-installed since so I can't check exact configs. It's not that hard. Just like LVS/Tun with Linux on LVS end. w2k part required bit trial and error but it's doable.
Adam Hammouda AdamMH (at) aol (dot) com 02 September 2003
I'm wondering if anyone can help me with some lvs/ipvs configuration issues regarding Windows' Real Server's and LVS-Tunneling. I have been able to setup LVS-Tun when all realservers are Linux-based, however when Windows is thrown into the equation things start to get messy. I have
- Created a new General IP Routing Tunnel (Interface) and set it's local and remote addresses' to the VIP and Director IP, respectively.
- configured the Microsoft Loopback Adapter to use the VIP, and set it's subnet mask to 255.0.0.0 as was recommended.
Chris Chris (at) baonline (dot) co (dot) uk 03 Sep 2003
We run lvs using tunneling (ipip) with 3 windows 2000 realservers. The steps are something like:
Paolo Penzo paolo.penzo (at) bancatoscana (dot) it 26 Sep 2003
I 'm using LVS on geographical basis (DR and TUN) with both Linux and Windows 2k as realservers. Unfortunately we started to migrate Win 2k severs to Win 2003 and we discovered that IP-IP encapsulation is not supported anymore by MS servers (see http://support.microsoft.com/?id=280484) so LVS TUN configurations don't work anymore if you use win 2k3 as realserver. I'm thinking how to overcame this problem by manually configuring IPSec tunnels or something similar... Help is wellcomed.
Joe: there was no answer
ipip encapsulation is used when the realservers are at a remote site. Methods of tunneling other than ipip exist (e.g. a VPN) if you need geographically remote realservers.
Richard Seabrook
Since Windows 2003 doesn't support IP-in-IP like 2000 did, what other alternatives are people using when real servers are remote from the directors?
Paolo Penzo paolo (dot) penzo (at) bancatoscana (dot) it 06 Dec 2006
We made a layer2 network spanned across geographical sites and moved to DR balancing: everthing is much more easy to manage!
A ipip header is added when sending packets through a tunnel. Since the mtu is fixed (1500), the extra header reduces the allowed packet payload size. This will require fragmenting of packets>1480 sent from the director to the realserver in LVS-Tun. LVS (and Linux) doesn't have any special code to handle ipip fragmentation, so we should have expected LVS-Tun to fail when the client sent packets large enough to require fragmentation in the DIP->RIP hop. Either few people were using LVS-Tun in production, or clients were only sending small packets (e.g. HTTP GET) and we didn't realise for a long time that we had a problem lurking. Further below is Casey Zacek's solution for both w2k and linux. Here is Julian's description of the problem.
Julian Feb 12 2007
The client will (Joe: should?) see a "fragmentation required" icmp packet from the director, if the packet is bigger than our PMTU to RS.
Note | |
---|---|
this problem is still present and it is hard to fix (it's a bug): |
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=107757685230840&w=2 |
Without handling ICMP errors for our IPIP packets, we will not lower the PMTU just by generating IPIP traffic. But other (non ipip) protocols (packets to RS) can learn lower PMTU and update the cache. Then we can see this lower value in the routing cache and generate reply ICMPs when large packets come from the client. At least that's what I remember from before; I'm not sure if things have changed in 2.6 now.
IPIP packets are between the DIP and RIP. These packets can hit the MTU limit in all hops between director and RS. The reply packets from RS to CLIENT are another path. If a big packet from the RS to CLIENT hits a MTU limit, then our director will receive ICMP/FRAG_NEEDED from xxxHOP to VIP, which we tunnel in IPIP to RIP. Here is a simple picture showing the MTU for each hop:
VIP<-CIP ------------------------------ | | 1500 1400 v 1300 1200 1100 ^ 1000 CLIENT ---> DHOP ---> DIRECTOR ---> RHOP ---> RS ---> CHOP ---> same CLIENT CIP->VIP DIP->RIP VIP->CIP CLIENT - knows about DHOP and uses MTU=1500 DHOP - hop/router to director, knows MTU=1400 to director director - sees MTU=1300 to RS, knows (or doesn't know) about RS (MTU=1200) RHOP - hop/router to realserver, knows about RS and uses MTU=1200 RS - connects to CHOP with MTU 1100 CHOP - hop/router to CLIENT, uses MTU 1000 |
The steps:
If instead the director knows about the 1200-byte limit, then any IPIP packet from director will reach RS without any ICMP replies. One way of doing this would be by setting the mtu for the route DIP->RIP. (This command sets a lower MTU for all packets, not just ipip packets.)
director# ip route add RIP via RHOP dev DEV src DIP mtu 1200 |
If the RS generates a 1500-byte TCP reply packet (VIP->CIP), then CHOP will generate ICMP reply to the VIP, that should come in director, if routed properly (this packet will likely traverse the internet using a path separate to the client-director-RS-client path). On arrival at the director, the director will use icmp.c:icmp_unreach() to learn the PMTU. ip_rt_frag_needed() will save the value in the routing cache. Since the director doesn't send packets to CHOP, the problem then is how this information is used. The kernel's ICMP protocol receiver parses the information, updates PMTU in cache, but fails to deliver it to the upper layers as happens when delivering errors to sockets. This time IPVS was the sender. That is why the LOCAL_IN hook exists, where IPVS can listen for these errors, but as I said, it is difficult to generate ICMP error to send to the CIP.
Another problem problem is what MTU to use between RS and clients, but IPVS should properly forward (tunnel) any ICMP errors from hops between RS and (before) client to the RS (Joe: the director?). The client will never trigger an ICMP reply, which is generated only by routers. CHOP replies to the packet VIP->CIP, so this ICMP packet comes to the VIP (director) and IPVS will select the appropriate connection in the ipvsadm table, and forward the ICMP information in an IPIP packet to the RIP (as happens for the regular TCP packets from the CIP). ip_vs() used to have forwarding of ICMP from the non-error class icmp packets (e.g. ICMP ECHO), but someone dropped it from 2.6 as an unused feature.
I hope that is how IPIP setups work.
awysock (at) absoftware (dot) com 28 Nov 2003
I've set up two UM Load blancers running LVS 1.0.10 and have them up and running. I'm using LVS-Tun since I rent my servers and my IP addresses are all over the place. My site deals with lots of photos, so my users are doing large POSTS along with large POSTS of Text data. It seems when the ethernet packet goes over the 1460 byte mark only some of the users fail others (my own machines) work just fine. I have tried it on my windows machine and my MAC I have no problem, but when somebody elsewhere on the net does the same function they fail with a 404 or timeout error on their end. Its only some of the people, others are not having the problems. If they go directly to the server it works. So I'm guessing it something between the LVS and the Real Servers.
I have changed the MTU value for eth0 on the director to 1400. All that does for me is make more machines (all that I have tested) suffer from the same problem. Should the MTU value be changed at different places? i.e. both ends of the tunnel? I knew that our choice to use Windows 2000, would haunt me! Does anyone know how to change the MTU for an IP tunnel in Windows 2000?
Enable PMTUDiscovery in w2k (http://insight.zdnet.co.uk/communications/networks/0,39020427,2123537-2,00.htm) and DrTCP (http://www.dslreports.com/drtcp) (Joe: presumably you want DRTCP019.exe, support for MTU set in w2k).
The MTU was originally set to 1500 on all machines. Most machines worked but some would not when posting large amounts of data.
- When I set the MTU for all interfaces on the director to 1400 and leave the MTU for the tunnel untouched at 1500, all machines would fail.
- When I set the MTU for all interfaces on the director to 1400 and set the MTU for the tunnel at 1400, all machines would fail.
- With the MTU for the tunnel set to 1400. I can set the MTU for the director to anywhere between 1420 - 1500 before it fails with all machines.
- The largest packet I can transmit on the ISP's network without it fragmenting is 1472 although they claim their MTU is 1500. (ping www.linux.org -l -f 1472 works but anything bigger does not)
This makes no sense to me. The only way I can think this is correct is if: Maximum packet size (without a tunnel) between director and realservers is 1500. If the header for IPIP tunnel is about 20 bytes, then the maximum packet size for packets within the tunnel is 1480. Therefore, the MTU for the director must be at least 20 more than the MTU for the tunnel. So why does using 1400 everywhere make it all fail, but 1500 everywhere only fail on some machines?
What can I set the MTU values to in order to guarantee it working with all clients? Most of our clients have no technical knowledge and this is becomming a nightmare!
Horms 30 Nov 2003
typically the MTU used is 1500 bytes. But when tunnels come into play then this becomes slightly smaller because of the overhead for the tunnel. This should not be an issue but in practice it often makes sense to manually set the MTU to the smaller value on applicable interfaces.
... or the mtu of the tunnel's routing entity for that matter. This is faster and less intrusive than adjusting down the whole physical interface's mtu. I use it for boxes where I have dozens of VPN tunnels over a physical interface, but also non-tunneled traffic.
Joe
Note | |
---|---|
Ratz is saying to change the MTU not for the interface (which will affect all routes through that interface), but only for the route. Presumably the route is DIP->RIP (the packet on arrival at the RIP is decapsulated to the packet with dest_addr=VIP). (Feb 2007 - Ratz posted that he got the idea from off-line discussions with Julian. But Ratz gets the credit for telling us about it.) |
Roberto Nibali ratz (at) drugphish (dot) ch 01 Jun 2004
You can set the mtu for a route to/from the VIP. You must of course pay attention to route selection which can be investigated with ip rule/ip route or the shell tools I've written to display routing tables. So you might need to put the VIP route into a special routing table which gets parsed before the other routes. Also don't forget to flush the routing cache.
Joe: in principle this is easy to do, but no-one has done it yet. The ipip packet from the director to the realserver is DIP->RIP. Ideally you would only want to change the mtu for the ipip packets to the RIP (or to the RIP network), so that other packets to the RIP (e.g. logging, administration) have standard MTUs. As well we aren't sure yet whether PMTU works, even if we do change the mtu for the DIP->RIP (someone could look in the code). Here's how Ratz changes the MTU for the default route.
Ratz 05 Feb 2007
Here I add a default route to a new table and change the default mtu. Basically you can use the "change" keyword in conjunction with the "mtu" selector on the specific route.
root@laphish2:~# ip route help Usage: ip route { list | flush } SELECTOR ip route get ADDRESS [ from ADDRESS iif STRING ] [ oif STRING ] [ tos TOS ] ip route { add | del | change | append | replace | monitor } ROUTE SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ] [ table TABLE_ID ] [ proto RTPROTO ] [ type TYPE ] [ scope SCOPE ] ROUTE := NODE_SPEC [ INFO_SPEC ] NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ] [ table TABLE_ID ] [ proto RTPROTO ] [ scope SCOPE ] [ metric METRIC ] [ mpath MP_ALGO ] INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]... NH := [ via ADDRESS ] [ dev STRING ] [ weight NUMBER ] NHFLAGS OPTIONS := FLAGS [ mtu NUMBER ] [ advmss NUMBER ] [ rtt NUMBER ] [ rttvar NUMBER ] [ window NUMBER] [ cwnd NUMBER ] [ ssthresh NUMBER ] [ realms REALM ] TYPE := [ unicast | local | broadcast | multicast | throw | unreachable | prohibit | blackhole | nat ] TABLE_ID := [ local | main | default | all | NUMBER ] SCOPE := [ host | link | global | NUMBER ] FLAGS := [ equalize ] MP_ALGO := { rr | drr | random | wrandom } NHFLAGS := [ onlink | pervasive ] RTPROTO := [ kernel | boot | static | NUMBER ] root@laphish2:~# ip route show 192.168.1.0/24 dev eth1 proto kernel scope link src 192.168.1.32 default via 192.168.1.1 dev eth1 root@laphish2:~# ip rule show 0: from all lookup local 32766: from all lookup main 32767: from all lookup default root@laphish2:~# ip rule add from 10.0.0.0/16 table 33 prio 100 root@laphish2:~# ip rule show 0: from all lookup local 100: from 10.0.0.0/16 lookup 33 32766: from all lookup main 32767: from all lookup default root@laphish2:~# ip route add default via 192.168.1.1 dev eth1 table 33 root@laphish2:~# ip route show table 33 default via 192.168.1.1 dev eth1 root@laphish2:~# ip route change default via 192.168.1.1 mtu 1000 table 33 root@laphish2:~# ip route show table 33 default via 192.168.1.1 dev eth1 mtu 1000 Cleanup the stuff: root@laphish2:~# ip route flush table 33 root@laphish2:~# ip rule del prio 100 root@laphish2:~# ip rule show 0: from all lookup local 32766: from all lookup main 32767: from all lookup default |
Jacob Coby jcoby (at) listingbook (dot) com 01 Dec 2003
Decreasing the MTU with this bug only causes more problems; it causes the packets to fragment MORE often. When I had the issue, I could decrease the MTU to 200 bytes, and the connection would fail at a payload of ~160 (20b for the IP header, 20b for the IPIP header), even with non-tcp data, like ping.
Julian 28 Nov 2003
try LVS with 2.4.23 as it contains a fix for packets longer than mtu.
(and later) Julian Anastasov 24 Feb 2004, 29 May 2004
There is only one remaining problem related to LVS-TUN: there is no handling of ICMP errors being received on a local IP after being returned from somewhere in the path (DIP->RIP) coming back to the DIP and containing the reply to tunneled packet (e.g. a frag_needed message and carrying the first few bytes of the packet). We do not relay these messages, generated between the director and the realserver, back to the client. The correct target for the ICMP message depends: the director is sending 20 bytes more (the ipip overhead), and if this is causing the ICMP message, then the client need not receive the ICMP message in all cases. The client should only receive an ICMP message if the director detects a lower PMTU. While TCP and UDP handle ICMP errors, IPIP does not handle them well. The LVS-DR and LVS-NAT forwarding preserve the sender's IP in which case ICMP traffic from realservers (or hosts before realservers) is always returned to the client. But if LVS-Tun is used, the ICMP packets are not returned to the client.
If the only traffic from the director to the LVS-Tun realservers is IPVS traffic, then the routing cache does not receive the PMTU info from ipip_err() and we don't learn the correct path MTU to the realserver. Then, on forwarding packets, the IPVS code cannot detect that the path has lower PMTU. But this is theory, not really tested. Maybe we can update the PMTU in the routing cache by listening to these ICMP errors in LOCAL_IN? Needs experiments and time for fixing, patches are welcome.
There is no such thing as an MTU for ipip with IPVS. IPVS extends the packet with 20 bytes by prepending IPIP header and ignores the mtu. IPVS has its own encapsulation and uses the route to the RIP (you do not need to configure a tunl0 device on the director).
I would love to upgrade the Kernel (currently 2.4.20) but that is not an option as a quick fix at the moment. - Live environment and the like.
This time the fix is not in the IPVS code: (see the kernel bug list http://linux.bkbits.net:8080/linux-2.4/hist/net/ipv4/ip_output.c?nav=index.html|src/.|src/net|src/net/ipv4). The problem is that skb->nfcache is not copied on [re]fragmentation. Here's a posting and patch by Julian to the linux-netdev mailing list posting and patch by Julian to the linux-netdev mailing list (http://marc.theaimsgroup.com/?l=linux-netdev&m=106589293316918&w=2).
But we need to see your tcpdump output first because the PMTUD (path MTU discovery) is usually enabled.
Joe
IPIP is a one-way channel (packets don't come back?) and PMTUD doesn't work?
Julian
The director still can receive ICMP errors with the source somewhere between the LVS-Tun realserver and the director.
Chris Paul
The problem is I can not reproduce the error. We only have a small number of non technical customers who are having trouble, but I can only go so far when it comes to asking them to debug our services.
Julian 01 Dec 2003
Then your problem is related to the client-director PMTU. I understand that it can be difficult to trace an unknown client, but do you have some kind of ICMP filtering between clients and the director?
The problem came up again in May 2004 (when the current kernel is 2.4.26).
Casey Zacek cz (at) neospire (dot) net 26 May 2004
The problem, as described by one of my customers, is this (the customer is running phpBB on 3 Linux/Apache servers with an LVS-Tun setup):
For very few users, when they post long posts (anything over a few lines) and hit submit, the browser appears to hang and finally it times out. Similar effects if they try and update their profiles. I even experienced this on my home computer. I use a proxy server sometimes and it showed the request being transmitted from my computer but ultimately no response was received from the site. Now, in most instances of this, we have found that the affected users are on broadband using a router of some type. I myself use a cable modem connected through a Linksys Router. When I experienced the issue, I was able to post from work, but not from home. I fiddled with my setup, thinking it was cookies or caching of some type and ultimately performed a firmware upgrade on my router. Suddenly the problem went away.At the time, I was running kernel 2.4.25 (IPVS 1.0.10), but since upgraded to 2.4.26 (IPVS 1.0.11), then 2.6.6 (IPVS 1.2.0). I have asked the customer to retest it, but he'll have to talk to some of his users, from the sound of things, since he upgraded his router firmware. I'd love to chalk it up to "client router problems," but that probably won't be good enough for this customer. The customer's setup worked using a Riverstone smartswitch router running what equates to LVS-NAT, but it does not work with this LVS-Tun setup.
With all three versions, I get a lot of these messages:
IPVS: ip_vs_tunnel_xmit(): frag needed
Julian
This message means that the IPVS director is generating ICMP errors to request that the client reduce the packet size. Maybe these ICMP messages are filtered somewhere and do not reach the client.
I have a step-by-step howto for TUN setups: http://www.ssi.bg/~ja/TUN-HOWTO.txt
Note | |
---|---|
Joe: This URL doesn't directly address the mtu problem. It checks the capsulation and routing. |
Joe
Why is the default MTU for ipip packets 1480, rather than 1500+overhead_for_ipip=1520? Is 1500 a hardware buffer size limit in the NICs? (i.e. hardware buffer=1500?)
Julian
I don't know which the origin of the 1500 limit. Maybe it is a balance between link sharing and protocol header overhead.
There is a convention in IPv4 to reply with an ICMP error if a packet with DF flag set reaches a smaller pipe (i.e. packet length > PMTU). If the DF flag is not set, the packet is fragmented into MTU-sized fragments.
Note | |
---|---|
For an explanation of PMTU and the DF flag, see PMTU - Path MTU Discovery (http://www.netheaven.com/pmtu.html). |
There can be many problems related to MTU:
Chris Paul Chris (at) baonline (dot) co (dot) uk 27 May 2004
The problem is caused by the linux kernel not taking into account the size of the ipip tunnel headers when sending traffic over an ipip tunnel.
Basically, the MTU (the largest size packet than can be sent over a network) is normally 1500 bytes. With the IP header information this drops to 1492, so the largest size of packet that can be sent over an IP link, before the packet get split into multiple packets is 1492 bytes. When you use ipip tunneling, there is an additional header that takes the maximum transmition size through the link to somthing like 1480. Linux kernel 2.4.??? does not take into account this additional header and sets the mtu for the ipip tunnel to 1492. So if you send a packet that is between 1480 and 1492, it gets truncated rather than split into multiple packets. The ipip tunnel destination then waits to receive the rest of the packet, which it never arrives. The result is the server never responds.
When I was having this problem, it was a nightmare because you can not guarantee it will fail. It only fails when the packet size is very specific and the size of the header is also large. To fix this you can either.
I solved it by changing the MTU values, but it was nearly a year ago now I and can't remember exactly which ones I changed, i.e., the RIP on the director, the tunnel from the director, or the tunnel from the realserver.
Chris Paul Chris (at) baonline (dot) co (dot) uk 27 May 2004
You have to change the mtu value on the end of the IP tunnel that initiates the tunnel i.e. the realserver (in this instance, a w2k box). This value should be close to the mtu value of the physical interface it is going through, but small enough to ensure there is enough space left for the ipip header. We use 1400 and have never had any reports of it failing. To do this you goto registry and add a dword entry called MTU with the decimal value 1400 (safe) into
hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of ip tunnel} |
reboot
Note | |
---|---|
with w2k, XP, you can "Restart Networking" |
Note | |
---|---|
Joe - see tunl MTU solved for Casey Zacek's modification of this method. |
If the mtu is not set, you get lots of IPVS: ip_vs_tunnel_xmit(): frag needed messages logged to the console and connections hang.
Joe
I would have thought you'd set the mtu at the director end. Presumably if any end of the segment has a reduced mtu, then both ends of the segment should be notified about it.
Julian
This message means "fragmentation needed but DF flag set". ip_vs_tunnel_xmit() tries to prepend IPIP header, but notices that the resulting packet with DF flag set will exceed the PMTU(director->RS) limit, so it generates an ICMP error instead of xmit-ing the packet to the RS.
Joe
What messages would you get if the icmp problem was about the link between the tunnel realserver and the director?
Messages from the same type but we haven't handled this case yet. In the meantime, setting the proper PMTU in the director for the route to the realserver is a good idea.
Jacob Coby jcoby (at) listingbook (dot) com 27 May 2004
I could test for the problem reliably by using ping with packet_size>934 (934 and lower worked fine). Once I bumped it up over 934, I'd see Must Fragment (MF) ICMP messages being sent, and the ping request would have no response. As I lowered the MTU, the size of the ping that would cause the problem lowered in direct proportion. A 1500 MTU would cause a 935 byte ping to fail, a 1400 MTU would cause a 835 byte ping to fail, and so on. Any HTTP GET or POST over that 934 byte payload would cause the site to not respond.
Chris Paul
Where are you setting your mtu of 1400? You have to make sure that it is the mtu for data inside the tunnel. When I changed the mtu values, the only way I could reliably get it to change the size inside the tunnel rather than the whole tunnel packets was from the realserver not the director.
Jacob Coby
I have no idea which MTU I was setting. I could get the problem to go away for one or two times, and then it would come back. It's been over a year since I messed with LVS-TUN, and I'm now running LVS-DR.
Peter Mueller pmueller (at) sidestep (dot) com 27 May 2004
I've heard people in poptop use this hack. Maybe you can modify for your use in this situation. If it works I like this solution better than a change to the MTU on the interface.
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300 |
Note | |
---|---|
Joe: MSS is maximum segment size, i.e. the payload in the packet, rather than the packet size (which is set by mtu). |
Julian 3 Jun 2004
Ratz's work around should work, or you can hope that other traffic between director and RS will update the PMTU in the routing cache.
Joe
if you put a tunl0 device on the director, would it receive the PMTU packets back from the realserver?
Yes, it can look into the ICMP errors that include IPIP header but the current version of ipip.c does not update the RIP's PMTU. Another option is IPVS to do it in LOCAL_IN. The VIP does not play here. The forwarded traffic in the director is routed to daddr=RIP (as for the other forwarding methods). Only the clients need a route to VIP.
Joe
I was thinking to reduce the MTU for the CIP-VIP segment, then there would be no problem in the DIP-RIP segment. Is this a way of handling it?
This is another solution. Just keep PMTU(CIP->VIP) <= PMTU(DIP->RIP) + 20. I'm not sure you can do it for every client. Maybe it can be in the default route :)
To use Ratz's work-around, you set the PMTU for packets going to the RIP (via eth0 on the director, there being no tunl0 devices on the director). If it is set to 1500 then you do not need such route as IPVS reports PMTU reduced with 20 (here 1480) when generating ICMP error to client. So, if PMTU to RIP is X or RS sends ICMP error to director notifying for PMTU=X then IPVS will report PMTU=(X-20) to client.
OTOH, may be it is not so difficult to check in LOCAL_IN for any FRAG_NEEDED errors and if they reduce the PMTU for RIP we can update the routing cache. Need to investigate whether we can easily find that such error is for one of our TUN RIPs.
What you can do on the director when using LVS-TUN:
run tcpdump and check for any received or generated ICMP errors
The PMTU is not updated in the routing cache if director receives ICMP_FRAG_NEEDED. This is easy to detect and to solve. The good news is that you can detect it from any client, send large file, tcpdump for ICMP errors coming from realservers to director. If this is the case (PMTU to RIP is lower than outdev's MTU) than you can try to specify pmtu in special route to RIP. Once the director knows the right PMTU to RIP then it will report it to every client that violates it. There is no need IPVS to relay the ICMP error coming from RIP to the client, we just know how to generate it on each request from client. The only benefit can be if ipip.c is patched to update the PMTU in the routing cache and to avoid creating special route to RIP.
Joe
what is the MTU doing in the output of ip addr show dev tunl0 when you have a tunl device on a machine? I can set it (can't I?). Is the mtu meaningless, ignored, what?
It is ignored for IPVS traffic, IPVS has its own encapsulation and uses the route to RIP (you do not need to configure tunl0 in director). The tunl0 device is usually needed to receive IPIP packets, so in normal cases you do not need such interface in director even when using TUN realservers. The PMTU setting must be for the route to RIP. Such setting (and special route to daddr=RIP) can be needed only if PMTU to RIP is less than the outdev MTU.
So with regular ipip tunneling (not ipvs) you only need the tunl0 device on the receiving end? The only reason you need a tunl0 device on the transmitting end is to handle the packets that reply?
For regular ipip purposes tunl0 can be used both for send and for receive. IPVS simply knows how to create ipip packets without using the ipip code.
Note | |
---|---|
Casey's solution is run on the realserver. Presumably a similar solution could be found for the director. Ratz's method of setting the mtu for the route rather than the interface runs on the director. |
Casey Zacek cz (at) neospire (dot) net 2005/03/11
I've emailed about this before, and nothing we ever came up which really worked. The real problem I've always had is that I've never had a means for duplicating it (possibly because I didn't fully understand the problem -- I can probably duplicate it at will now), and my customers have eventually just either accepted it and moved on or changed to an LVS-NAT environment. I finally came across someone whose home network was setup in such a way as to experience the "problem", so I decided to figure it out once and for all and hopefully end all the confusion. Attached is a piece of PHP (lvs-tun-test.php) that'll duplicate the problem. The "submit" query will timeout if you are experiencing the problem.
Matthew Boehm matthew (at) matthewboehm (dot) com 6 Jan 2007 (and Casey).
Note | |
---|---|
With IE6/7: When you submit the POST, the page just reloads (Matthew) or hangs/timesout with no data posted (Casey). With Firefox/Netscape: You get a "Bad Request" page. |
Cut here --- lvs-tun-test.php -------------------------------------- <html> <head> <title>big POST test</title> </head> <body> <?php echo $HTTP_POST_VARS['test']; ?> <form action="lvs-tun-test.php" method="POST"> <textarea name="test" cols="100" rows="10"> alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf </textarea> <input type="submit"> </form> </body> </html> Cut here ----------------------------------------------------------- |
In order to force yourself to experience the problem, you need to forcefully ignore icmp fragmentation-needed packets. I am able to do that on my home network with a simple iptables rule on my firewall:
iptables -I FORWARD -p icmp --icmp-type fragmentation-needed -j DROP |
Now, I browse the lvs-tun-test.php through LVS-Tun, and click submit, and it just hangs and times out. tcpdump shows the expected results. Then I change the MTU on the loopback interface on the realserver (It's a w2k box) using regedit, then disable and re-enable the loopback adapter via the network properties, then click submit again. Poof, it works.
tcpdump is my friend. I started out running tcpdump on the director:
23:13:52.804610 IP (tos 0x0, ttl 116, id 26413, offset 0, flags [DF], length: 48) CIP.60964 > VIP.80: S [tcp sum ok] 3288780265:3288780265(0) win 65535 <mss 1452,nop,nop,sackOK> 23:13:52.810423 IP (tos 0x0, ttl 116, id 26415, offset 0, flags [DF], length: 40) CIP.60964 > VIP.80: . [tcp sum ok] 3288780266:3288780266(0) ack 2303765635 win 65535 23:13:52.813943 IP (tos 0x0, ttl 116, id 26416, offset 0, flags [DF], length: 602) CIP.60964 > VIP.80: P [tcp sum ok] 0:562(562) ack 1 win 65535 23:13:52.820802 IP (tos 0x0, ttl 116, id 26417, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 562:2014(1452) ack 1 win 65535 23:13:52.820887 IP (tos 0xc0, ttl 64, id 25185, offset 0, flags [none], length: 576) VIP > CIP: icmp 556: VIP unreachable - need to frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26417, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . 562:2014(1452) ack 1 win 65535 23:13:52.827175 IP (tos 0x0, ttl 116, id 26419, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 2014:3466(1452) ack 90 win 65446 23:13:52.827251 IP (tos 0xc0, ttl 64, id 25186, offset 0, flags [none], length: 576) VIP > CIP: icmp 556: VIP unreachable - need to frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26419, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . 2014:3466(1452) ack 90 win 65446 23:13:52.833420 IP (tos 0x0, ttl 116, id 26420, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 3466:4918(1452) ack 90 win 65446 |
The tcp [DF] CIP->VIP (packet length 1492 -- too big), then IPVS's ICMP response continues until the request eventually times out. This message is generated every time one of the ICMP responses are sent:
IPVS: ip_vs_tunnel_xmit(): frag needed |
The problem comes when the ICMP host-unreachable (change MTU) packets are ignored/dropped and not acted-upon by the client. This is a more common situation than I thought would be the case.
A few hours of debugging later, I realized that the SYN+ACK packet, the response from the real server to continue the connection handshake, is missing. Duh. I moved my tcpdumping to a tap in the network that I knew would get all of the traffic. The SYN+ACK packet establishes the MSS (max segment size -- the data segment size for the packets for this connection) to 1452, just as the client machine requests (the first packet in the earlier trace).
Duh! I had read all the stuff on the URL above, and the posting by Chris Paul comes closest to describing the solution:
In reality, it's not "the end of the IP tunnel that initiates the tunnel" because the tunnel interface on the w2k box doesn't initiate anything -- it only receives forwarded traffic from the director. What he really means is "the interface on the real server that is handshaking the TCP connection with the client." The goal is to get the client to send smaller packets so that they'll make it on to the realserver.
CLIENT sends SYN to DIRECTOR DIRECTOR encapsulates SYN packet in IPIP tunnel; sends to REALSERVER REALSERVER receives SYN packet on LOOPBACK interface REALSERVER sends SYNACK to CLIENT from LOOPBACK interface w/ MSS=1452 CLIENT sends ACK to DIRECTOR, on to REALSERVER REALSERVER responds to CLIENT from LOOPBACK repeat until dead |
So, we have to change that MSS that gets sent back from realserver to client. That is, set the MTU on the loopback interface on the w2k box. The solution is to do exactly what Chris Paul Chris said, except change from:
hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of ip tunnel} |
to:
hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of MS Loopback Adapter} |
After all, if you set an MTU in the IP tunnel interface this way, it won't be there after you reboot, I've found. Oh, and 1480 is the magic number. 1400 is safe, but 1480 works. Any higher than that, and it doesn't work as desired.
So I went to investigate how to do the same thing on my Linux real servers, only to find that the tunl0 interface, which is the connection endpoint for Linux realservers, already has an MTU of 1480. I don't know when that got fixed, but I guess I won't worry about it.
(later) I was wrong; here's the fix for Linux realservers:
iptables -A OUTPUT -s VIRTUAL-IP -p tcp -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j TCPMSS --set-mss 1440 |
Tested, tcpdumped, works. Now I have no more 'IPVS: ip_vs_tunnel_xmit(): frag needed' messages. (At least for now. We'll see if I'm wrong tomorrow.)
Chris Paul, 11 Mar 2005
Isn't this fixed in Kernel 2.6 anyway
Casey Zacek cz (at) neospire (dot) net
I really don't think it's possible to fix this on the director (and my directors are running 2.6.11 anyway -- and it's not fixed there). The closest way I could think of was to ignore the DF flag in the incoming TCP packets and just fragment them anyway.
Casey Zacek cz (at) neospire (dot) net 2005/04/12
It's not fixed in 2.6; I still need the iptables rule to set the mss
# iptables -A OUTPUT -s VIRTUAL-IP -p tcp -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j TCPMSS --set-mss 1440 |
Note | |
---|---|
Joe: we don't know why this works |
Julian Feb 2007
Huh, I don't know why, may be because there is such limit somewhere in the path from RS to client. Path from RS to client is not different between real servers in DR or TUN mode, they both send normal reply from VIP to CIP, no IPIP is involved there. May be problem with a CHOP that can not route ICMP to VIP properly.
[email protected] J (dot) Libak (at) sh (dot) cvut (dot) cz 07 Dec 2006
Today I ran into an MTU problem with LVS-Tun. Small packets were forwarded to real servers without problems, but the bigger ones weren't and TCP retransmissions occurred. I noticed the problem dissapeared when I switched to LVS/DR so this gave me hint to where the problem might be. MTU 1480 had to be set on the outgoing interface of realservers with tunl0 having standard 1500. Directors have 1500 on all interfaces. This way TCP syn ack contained correct MTU and the client didn't send big packets that were discarded on director anymore. IP header is 20 bytes long so 1480 is the maximum value that works.
Note | |
---|---|
This works on the realserver, but not on the director. We don't know why it doesn't work on the director and we're not really sure why it works on the realserver either. |
With Casey having a suitable test setup, we asked him to test setting the MTU by route using Julian's suggestion of
director# ip route add RIP via RHOP dev DEV src DIP mtu 1440 |
Casey Zacek cz (at) neospire (dot) net 14 Feb 2007
Nope. Doesn't work. Here's tcpdump running on the realserver showing the first packet back to the client, which negotiates the MSS for the connection.
21:52:37.819770 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], length: 48) \ VIP.80 > ENDUSER.1276: S [tcp sum ok] 2051800163:2051800163(0) \ ack 1809535240 win 5840 <mss 1460,nop,nop,sackOK> |
That "mss 1460" needs to be "mss 1440". That's the secret magic key to the universe.
I got some of these when I blocked icmp-type fragmentation-needed to my workstation, with logging:
IN=eth0 OUT= MAC=00:18:8b:74:d1:98:00:06:5b:3a:9f:0b:08:00 \ SRC=66.111.105.216 DST=10.3.3.10 LEN=576 TOS=0x00 PREC=0x00 TTL=62 ID=32755 \ PROTO=ICMP TYPE=3 CODE=4 [SRC=10.3.3.10 DST=66.111.105.216 LEN=1500 TOS=0x00 \ PREC=0x00 TTL=62 ID=42460 DF PROTO=TCP SPT=45445 DPT=80 WINDOW=114 RES=0x00 ACK URGP=0 ] MTU=1420 |
And my page request just waited and waited (Firefox 2.0). When I flushed the icmp-type fragmentation-needed DROP rules, and I submit the page again, it goes through instantly. I also tried with
director# ip route add RIP via RHOP dev DEV src DIP mtu lock 1440 ^^^^ |
This also did not work.
Julian
To tell if this is a PMTU problem (rather than we haven't figured out the correct ip route command), one should check all steps with tcpdump in all boxes, icmp, tcp.
Now, I can make it work if I do this on the real server:
So, at least it doesn't require iptables. Also, this doesn't cover any client machine that is not reached via the default route. Instead you'd need something more like this:
RS# ip route add table 42 to LOCALNET/xx dev LOCALDEV advmss 1440 RS# ip route add table 42 to default via DEFAULTGW advmss 1440 .. more entries for any static or other routes .. RS# ip rule add from VIP table 42 priority 42 |
In most cases, though, these two routes and one rule will cover it. I think I prefer using iproute to using iptables, as iptables tends to be more volatile in my environments.