37. LVS: Weird hardware (and software)

37.1. Arp caching defeats Heartbeat switchover

Claudio Di-Martino claudio (at) claudio (dot) csita (dot) unige (dot) it

I've set up a LVS using direct routing composed of two linux-2.2.9 boxes with the 0.4 patch applied. The load balancer acts as a local node too. I configured mon to monitor the state of the services and update the redirect table accordingly. I also configured heartbeat so that when the load balancer fails the second machine takes over the virtual ip, sets up the redirect table and starts mon. When the load balancer restarts, the backup reconfigures itself as a realserver, drops the interface alias that carries the virtual ip, stops mon, clears the redirect table. Although the configuration of the two machines is set up correctly it fails to restore the load balancer due to arp caching problems.

It seems that the local gateway keeps routing requests for the virtual ip to the load balancer backup. Sending gratuitous arp packets from the load balancer doesn't have effect since the interface of the backup is still alive and responding.

Has anyone encountered a similar problem and is there a hack or a proper solution to take back control of the virtual ip?

Antony Lee AntonyL (at) hwl (dot) com (dot) hk

I am new to LVS and I have a problem in setting up two LVSes for failover issue. The problem is related to the ARP caching of the primary LVS' MAC address in the realservers and the router connected to the Internet. The problem leads all the Internet connections stalled until all ARP caching in Web Servers and router to be expired. Can anyone help to solve the problem by making some changes in the Linux LVS ? (It is because I am not able to change the router ARP cache time. The router is not owned by the Web hosting company not by me.)

In each LVS, there are two network card installed. The eth0 is connected to a router which is connected to the Internet. The eth1 is connected to a private network which is the same segment as the two NT IIS4.

The eth0 of the primary LVS is assigned an IP address 202.53.128.56
The eth0 of the backup LVS is assigned an IP address 202.53.128.57
The eth1 of the primary LVS is assigned an IP address 192.128.1.9
The eth1 of the primary LVS is assigned an IP address 192.128.1.10

In addition, both primary and backup LVS have enabled the IPV4 FORWARD and IPV4 DEFRAG. In the file /etc/rc.d/rc.local the following command was also added:

ipchains -A -j MASQ 192.168.1.0/24 -d 0.0.0.0/0

I use the piranha to configure the LVS so that the two LVS have a common IP address 202.53.128.58 in the eth0 as eth0:1. And have a IP address 192.128.1.1 in the eth1 as eth1:1

The pulse daemon is also automatically be run when the two LVSes were booted.

In my configuration, the Internet clients can still access to our Web server with one of the NT was disconnected from the LVS. The backup LVS --CAN AUTOMATICALLY-- take up the role of the primary LVS when the primary LVS is shut down or disconnected from the backup LVS. However, I found that all the NT Web Servers cannot reach the backup LVS through the common IP address 192.128.1.1, and all the Internet clients stalled to connect to our web servers.

Later, I found that the problem may due to the ARP caching in the Web Servers and router. I tried to limit the ARP cache time to 5 seconds in the NT servers and half of the problem has solved ,i.e. the NT Web servers can reach the backup LVS through the common IP address 192.128.1.1 when the primary LVS was down. However, it is still cannot be connected through the Internet clients when the LVS failover occur.

Wensong

I just tried two LVS boxes with piranha 0.3.15. When the primary LVS stops or fails, the backup will take over and send out 5 Gratuitous Arp packets for the VIP and the NAT router IP respectively, which should clean the ARP caching in both the web servers and the external router.

After the LVS failover occurs, the established connections from the clients will be lost in the current version, and the clients need to re-connection the LVS.

.. 5 ARP packets for each IP address, and 10 for both the VIP and
the NAT router IP. I saw the log file as follows:

Mar  3 11:12:14 PDL-Linux2 pulse[4910]: running command "/sbin/ifconfig" "eth0:5" "192.168.10.1" "up"
Mar  3 11:12:14 PDL-Linux2 pulse[4908]: running command "/usr/sbin/send_arp" "-i" "eth0" "192.168.10.1" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:14 PDL-Linux2 pulse[4913]: running command  "/sbin/ifconfig" "eth0:1" "172.26.20.118" "up"
Mar  3 11:12:14 PDL-Linux2 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET)
Mar  3 11:12:14 PDL-Linux2 pulse[4909]: running command "/usr/sbin/send_arp" "-i" "eth0" "172.26.20.118" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:17 PDL-Linux2 nanny[4911]: making 192.168.10.2:80 available

I don't know if the target addresses of the 2 send_arp commands are set correctly. I am not sure if it is different when broadcast or source IP is used as target address, or any target address is OK.

Horms

Are there just 5 ARPs or 5 to start this and then more gratuitous ARPs at regular intervals. If the gratuitous ARPs only occur at fail-over then once the ARP caches on hosts expire there is a chance that a failed host - whose kernel is still functional - could reply to an ARP request.

wanger (at) redhat (dot) com

When we put this together, I talked to Alan Cox about this. His opinion was that send 5 ARPs out at 2 seconds apart. If there is something out there listening and cares, then it will pick it up.

The way piranha works, as long as the kernel is alive, the backup (or failed node) will not maintain any interfaces that are Piranha managed. In other words, it removes any of those IPs/interfaces from its routing table upon failure recovery.

37.2. Weird Software I: IE client

Note
We haven't heard back from this poster, so we don't know what the problem may be. This is in the HOWTO just incase someone else comes up with the same problem. As is noted elsewhere in this HOWTO (but I can't find where), IE assumes it's negotiating with a M$ server and doesn't do the tcpip handshaking in the IETF approved manner.

Sebastiaan Tesink maillist-lvs (at) virtualconcepts (dot) nl 14 Jul 2006

On one of our clusters we have problems with ipvs at the moment. Our cluster is built with 2 front-end failover ipvs-nodes (managed with ldirectord), with 3 Apache back-end nodes, handling both http as well as https. So all the traffic on a virtual ip on port 80 or 443 of the front-end servers is redirected to the backend webservers.

Two weeks ago, we were running a 2.6.8-2-686-smp Debian stable kernel, containing ipvs 1.2.0. We experienced weekly (6 to 8 days) server crashes, which caused the machines to hang completely without any log-information whatsoever. These crashes seemed to be related to IPVS, since all our servers have the exact same configuration, except for the additional ipvs-modules on the front-end servers. Additionally, the same Dell SC1425 servers are used for all servers.

For this reason we upgraded our kernel to 2.6.16-2-686-smp (containing ipvs 1.2.1) on Debian stable, which we installed from backports (http://www.backports.org). There aren't any crashes on these machines anymore. However, there are two strange things we noticed since this upgrade. First of all, the number of active connections has increased dramatically, from 1,200 with a 2.6.8-2-686-smp kernel, to well over 30,000 with the new kernel. We are handling the same amount of traffic.

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  XXX.net wlc persistent 120
  -> apache1:https                  Route   10     2          0
  -> apache2:https                  Route   10     25         0
  -> apache3:https                  Route   10     14         0
TCP  XXX.net wlc persistent 120
  -> apache1:www                    Route   10     10928       13
  -> apache2:www                    Route   10     11433      6
  -> apache3:www                    Route   10     11764      10

We are using the following IPVS modules: ip_vs ip_vs_rr ip_vs_wlc

Secondly, Internet Explorer users are experiencing problems exactly since the upgrade to the new ipvs version. With Internet Explorer, an enormous amount of tcp-connections is opened when visiting a website. Users are experiencing high loads on their local machines, and crashing Internet Explorers. With any version of FireFox this is working fine by the way. Nevertheless, this started exactly since our IPVS upgrade.

I've added two tcpdumps at the bottom of this e-mail. The first one is output of FireFox 1.0. The second is a tcpdump taken from the same machine, using IE6. I'm sorry for the length of the attachments, but it seems rather relevant.

37.3. Weird Hardware I: cisco catalyst routers gratuitously cache arp data (failover is slow)

Some hardware manufacturers release equipment with broken tcpip implementations.

Sean Roe May 06, 2004

I was looking for some info on cisco catalyst switches to help speed up the failover between my two director boxes. I have the following LVS-NAT setup:

                   |--------|-----|WebServer1|
       -- |LVS01|--|Cisco   |-----|WebServer2|
       |           |        |-----|WebServer3|
 ------|           |Catalyst|-----|WebServer4|
       |           |        |-----|WebServer5|
       -- |LVS02|--|Switch  |-----|WebServer6|
                   |--------|
  Virt     LVS                    Real
  IP       Servers                Servers

My Problem is that if lvs01 fails lvs takes over the load, but it takes forever (5-6 minutes). for the realservers to start using the new director. It also seems that it works faster, if I actually restart the httpd on each webserver. This is a LVS-NAT with multiple virtual IPS going to different ports on the webservers.

John Reuning john (at) metalab (dot) unc (dot) edu 23 Apr 2004

I've seen a similar delay in failover when using cisco routers. They don't update the internal MAC address table after receiving gratuitous arp packets during an LVS director failover event. I don't know if the heartbeat package uses arps to fail over, but keepalived does. Cisco routers seem to need icmp packets before they'll update the MAC address table. For LVS, the problem here is that the router continues to send traffic to the VIP at the master's hw address instead of shifting to the backup's hw address.

However, this wouldn't explain why your realservers route to the wrong address. The realservers and the LVS directors are on the same network segment, right?

The problem isn't with the layer-2 switches, it's with the next-hop router (the external default gateway for the LVS directors). It's common behavior with Cisco routers to update their arp cache table in response to source-generated packets but not in response to gratuitous arp packets.

Peter Mueller

I've seen a similar delay in failover when using cisco routers.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 24 Apr 2004

Me too, ISPs often configure managed routers to not respond to arp requests. You tend to have to ask them to flush the routing table if you change any of your router facing ips. I'm sure the routers can be configured to respond to ARPs

Horms 07 May 2004

Sounds like there could be a problem with your gratuitous arps that are supposed to effect failover. I have used catalyst swithces quite a lot, in fact both my test rack and the main switch for the network here at VA Japan used catalyst switches. I have found that they are quite aggressive about caching ARP information, and in some cases seem to effect proxy arp. But the current send_arp code in heartbeat seems to work just fine. Actually, I some times run that command manually after rearanging IP addresses on machines.

37.4. Weird Hardware II: autonegotiation failure on cisco CSS 11050

Ed Fisher efisher (at) mrskin (dot) com 08 Feb 2005

We're trying to setup LVS to serve as a drop-in replacement for a pair of Cisco CSS 11050s. We aren't doing any fancy layer 7 stuff on the CSS, like passing certain directories to other servers, or anything like that. I got it all setup, working, and I was able to drop it in for the CSSes pretty smoothly. Our traffic spikes on the CSS reach 90mbit/s. Not huge by a lot of standards, but still sizable. The CSS was pushing out about 50mbit/s when we cut over to the LVS-NAT box, and traffic immediately dropped to about 20mbit/s, never breaking 30. A test download from a box on another network, with a 100mbit connection to the Internet, was able to download a single file at well over 40mbit/s through the CSS. Through the LVS, it peaked at 1Mbit/s at the beginning and then quickly fell to about 300kbit/s after a few seconds, and stayed there. The hardware for the LVS machine: P4 2.26ghz, 2GB of memory. Two e1000 NICs, but both are hooked up to 100mbit switches, since we haven't done our gigabit upgrade yet.

The problem turned out to be that I was plugging into an extreme 24e2 switch, which was uplinked to an extreme 1i router. It was the connection between the 24e2 and the 1i that was bad. The 1i was set to force full duplex, the switch was set to autonegotiate, and so was, for annoying reasons, defaulting to half duplex. I plugged the LVS machine directly into the 1i, set the port to auto-negotiate on the 1i and the LVS machine, and linked up at 1000baseT FD and performance increased dramatically. Testing without the css in the mix showed I was indeed able to saturate the link.

37.5. Weird Hardware III: Watchguard firewall at client site

Jacob Coby jcoby (at) listingbook (dot) com 29 Jul 2005

I've got a client IP addr that, on occassion, takes up a mass of connections and leaves them in an ESTABLISHED state. The IP addr is of a business that uses our website, but it's causing a DOS of sorts.

Software:

ipvsadm v1.21 2002/07/09 (compiled with popt and IPVS v1.0.4)
kernel-2.4.20-28.7.um.3
redhat 7.3

Symptoms:

  • Connections jump from 50-100 up to 300-600.
  • A single IP address takes up 80-90% of those connections.
  • All of the connections from that ip address are in the ESTABLISHED state.
  • Very few of them are actually sending/receiving data (when using tcpdump -xX -s 1024 "host bad.ip.addr"). I see a few packets with the F and S flags set.

I have disabled Keepalive on the real servers (Keepalive on allows them to close connection by themselves). It's too expensive to keep enabled for our site. Any ideas? snort logs don't show anything malicious from the ip. Because these are all ESTABLISHED connections to our website, they're taking up an apache process, and eventually locking everyone else out.

Graeme Fowler

Since it's an application-level problem your LVS is doing exactly what it should :)

Jacob

Yeah, that's what I was thinking. I didn't know if LVS was accidently not FIN'ing connections or whatever.

Graeme

If you switch on the extended status and server-status handlers in Apache, you can check what Apache thinks is happening, at the very least. If it's always the same source IP, I'd consider tracking down what the machine is and seeing if (for example) it's a broken proxy, or whether you can actually route back to it - if the latter then it could be a simple network problem (or even a complex one!).

Jacob

I'd forgotten about the server-status modules.. I've actually got them enabled. As for trying to route back to the ip addr, good idea. I'll try that next time the issue happens. I know the ip addr is a NAT firewall, so it could be the firewall, it could be an office computer, or it could be a personal notebook causing the problem. They've scanned all of their computers for viruses. My only thought at this point is some spyware is screwing up the tcp stack and isn't closing connections.

later...

As an update, we traced this back to a problem with the firewall the company is using. Both firewalls are made by Watchguard (http://www.watchguard.com). They are different models - one is a Firebox and I'm not sure what the other is. The Firebox apparently can operate in two NAT modes: direct (?) and proxy. In direct mode, it is leaving hundreds of connections in the EST state. In proxy mode, it works correctly. The other firewall can only act in direct mode, so it is still causing a problem.

37.6. Weird Hardware IV: wrong device gets MAC address

Note
An HP ProCurve switch is used in this installation, but Horms doesn't think it's the problem.

Troy Hakala troy (at) recipezaar (dot) com

In an LVS-NAT setup, on a rare occassion, the ARP cache of one of the realservers gets the wrong MAC address for the director, I assume after a re-ARP. It gets the MAC of eth0 instead of eth1. It's easy to fix with an arp -s, but I'd like to understand why this happens.

Horms 18 Nov 2005

Taking a wild guess, I would say that eth1 is handling a connection that has the VIP as the local address. And during the course of that connection, the local arp cache expires. The director sends an arp-request to refresh its cache. However, the source address of the ARP requests is the VIP, as it is a connection to the VIP that caused the ARP requests. ARP requests actually act as ARP announcements. And thus the MAC of eth0 is advertised as the MAC of the VIP.

Just a guess. If its correct, then this is exactly the problem that the arp_announce proc entry is designed to address. Or alternatively you can use arptables.

This is the second half of the ARP problem that has to be solved when doing LVS-DR, and I have some limited explanation of it at http://www.ultramonkey.org/3/topologies/hc-ha-lb-eg.html#real-servers Just ignore the bits that aren't about either arp_announce or /sbin/arptables -A OUT -j mangle...

It could also be that for some completely perverse reason eth1 is receiving an ARP request for the VIP. If that case you are really in the same boat as using LVS-DR.

37.7. Weird Hardware V: SonicWAll firewall rewriting sequence numbers

G. Allen Morris III gam3-lvs-users (at) gam3 (dot) net6 Mar 2006

It seems that the firewall changes the sequence number of packets comming in (changing them again on the way out). This of course breaks LVS-Tun as the sequence number does not get restored when it leaves the network wrapped in the IPIP packet.

I can not find any SonicWALL documentation and would like to know if anyone here knows if there is a fix for this.

37.8. Weird Hardware VI: cisco 2924XL switch

Tony Spencer tony (at) games-master (dot) co (dot) uk 10 Mar 2006

We have a couple LVS's running on Centos 4.2 they are working fine and failover as they should, the backup server takes the VIP when the primary server is taken offline. However I've noticed an issue when the failover occurs.

From inside our network we can get to web sites and radius server using the VIP when the failover occurs. But from outside our network we can't, the connection fails. Both servers are plugged into a Cisco 2924XL switch which sees the IP move from one port to another when it fails over. Into the same switch is our upstream link also.

Could it be an arp issue with the upstream because the MAC address of the VIP has changed? I thought this at first but even when I brought the primary server backup so the VIP was on the same MAC address it still wouldn't work.

37.9. Weird Hardware VII: unknown switches don't defragment

Andreas Lundqvist lvs (at) rsv (dot) se 1 Nov 2006

I have a LVS running on Suse Sles9 with three linux realservers and direct routing. This is part of our intranet that spans nationwide and upon launch here this Monday I had not run into any problems what so ever. This isn't the case anymore.. My problem is that clients in two sites in different city's are getting their connections the status FIN_WAIT and will not timeout, so now I have 15000 FIN_WAIT's per realserver and still rising each day. Other sites including the site I'm in does not have this problem, my FIN_WAIT's from my PC times out just fine.

I'm told by our network guy's that just these two sites are indeed running different network hardware than our other sites (cisco).

Later - We seem to have found a workaround. The problem is that our WAN only allows packets with maximum size of 1518 and with encryption headers we exceed that. Our other site's fix this in their switches before sending it out on our WAN but this is apparently not supported in the switches on my two problem site's. So the fix was to lower MTU on my three Realservers to 1400. I had to reboot my LVS server to free the hanging FIN_WAIT's - I tried to just unload the modules but it just hung the rmmod command.