29. LVS: High Availability, Failover protection

29.1. Introduction

In a production system you want to be able to do planned maintenance: remove, upgrade, add or replace nodes, without interruption of service to the client. Machines may crash, so a mechanism for automatically handling this is required too. Redundancy of services on the realservers is one of the useful features of LVS. One machine/service can be removed from the functioning virtual server for upgrade or moving of the machine and can be brought back on line later without interruption of service to the client.

The most common problem found is loss of network access or extreme slowdown (or DoS). Hardware failure or an OS crash (on unix) is less likely. Spinning media (disks) fail near the end of the warrantee period (in my experience) - you should replace your disks preemptively. The director(s) don't need hard disks. I've run my director from 30M of files (including perl and full glibc) pulled from my Slackware distribution. Presumably a mini Linux distribution would be even smaller. You should be able to boot off a floppy/cdrom/flash disk and load all files onto a small ramdisk. Logging information (e.g. for security) can be mailed/scp'ed at intervals to a remote machine via a NIC used for monitoring (note: not one of the NIC's used to connect to the outside world or to the realservers). Reconfiguring services on the fly with ipvsadm will not interrupt current sessions. You can reasonably expect your director to stay up for a long time without crashing and will not need to be brought down for servicing any more than any other diskless router.

An alternative to flash memory is a cdrom

"Matthew S. Crocker" matthew (at) crocker (dot) com 14 May 2002

My LVS servers are currently EXT2 but I'm either going to go with a diskless server using netboot or a CD based server. Our LVS is becoming our firewall (using NAT) and I'd rather have it stay bullet proof. CD based if it gets compromised I just reboot it.

The LVS code itself does not provide high availability. Other software is used in conjunction with LVS to provide high availability (i.e. to switch out a failed realserver/service or a failed director). Several families of tools are available to automatically handle failout for LVS. Conceptually they are a separate layer to LVS. Some separately setup LVS and the monitoring layer. Others will setup LVS for you and administratively the two layers are not separable.

Here's an article on the high cost of delivering high uptime computer service by Steve Levin. The author says that NASA runs on three 9's (99.9%) reliability. For this level of reliability, the system has to handle all faults without human intervention.

There are two types of failures with an LVS.

  • director failure

    This is handled by having a redundant director available. Director failover is handled in the Ultra Monkey Project [12] by heartbeat. Other code used for failover is vrrpd in keepalived.

    The director maintains session information client IP, realserver IP, realserver port), and on failover this information must be available on the new director. On simple failover, where a new director is just swapped in, in place of the old one, the session information is not transferred to the new director and the client will loose their session. Transferring this information is handled by the Server State Sync Demon.

    The keepalived project [13] by Alexandre Cassen works with both Linux-HA and LVS. keepalived watches the health of services. It also controls failover of directors using vrrpd.

  • realserver failure, or failure of a service on a realserver

    This is relatively simple to handle (compared to director failover).

    An agent running on the director monitors the services on the realservers. If a service goes down, that service is removed from the ipvsadm table. When the service comes back up, the service is added back to the ipvsadm table. There is no separate handling of realserver failure. If the server catches on fire (a concern of Mattieu Marc marc (dot) mathieu (at) metcelo (dot) com), the agent on the director will just remove that realserver's services from the ipvsadm table as they go down.

    For LVS-DR, you cannot monitor a service running on the VIP on the realserver from the director (since the director also has the VIP). Instead you arrange for the service to bind to both the VIP and the RIP (or to 0.0.0.0) and test the health of the service bound to the RIP, as a proxy for the service running on the VIP.

    You can monitor a tcp service by connecting to the ip:port. Testing of udp services (e.g. DNS) is a little more problematic.

    Note
    The DNS monitor that comes with Mon does a functional test on the realserver, asking it to retreive a known DNS entry.

    Tim Hasson tim (at) aidasystems (dot) com 27 Jan 2004

    The attached patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/hasson_dns.patch) gets around the problem with ldirectord not doing any udp checks, by using Net::DNS to test if the DNS server is resolving. You cannot simply just do a udp connect check, as udp is connectionless :) That is why ldirectord will always keep all realservers on any udp service, regardless of the service status. So, you just basically install Net::DNS from cpan, and apply the attached patch to /usr/sbin/ldirectord You can change www.test.com in the patch or in ldirectord after you applied the patch if you need to specify an internal domain or something else. The patch applied to several ldirectord versions cleanly, including the latest from ultramonkey heartbeat-ldirectord-1.0.4.rpm (I believe it was ldirectord 1.76)

    The configure script monitors services with Mon. Setting up mon is covered in Failover The configure script will set up mon for you. Mon was the first tool used with LVS to handle failover. It does not handle director failover.

    In the Ultra Monkey Project, service failure is monitored by ldirectord.

For service failure on the realserver or director failure (without the Server State Sync Demon), the client's session with the realserver will be lost. This is no different to what would happen if you were using a single server instead of an LVS. With LVS and failover however, the client will be presented with a new connection when they initiate a reconnect. Since only one of several realservers failed, only some of the clients will experience loss of connection, unlike the single server case where all clients loose their connection. In the case of http, the client will not even realise that the server/service has failed, since they get a new connection when clicking on a link. For session oriented connections (e.g.https, telnet) all unsaved data and session information will be lost.

If you have a separate firewall, it doesn't have to be Linux

Clint Byrum cbyrum (at) spamaps (dot) org 2005/22/05

Honestly, as good as LVS is for real server load balancing, for firewalls I like OpenBSD with CARP and pfsync. CARP+pfsync provides easy, scalable load balancing and HA for firewalls. pf, the OpenBSD firewall, is very well written and nicely designed. Give it a look, www.openbsd.com.

Note

Carp is available for Linux too.

anon

The part I do not understand is how to have a LVS cluster failover without using HA. Since HA is limited to two nodes?

There are several packages available to do failover for LVS. Some of them overlap in functionality and some of them are for different purposes.

The LVS can have any number of realservers. Failover of realservers occurs by changing the ipvsadm table on the director.

Director failover occurs by transfering the VIP to the backup director, bringing down the primary director, and by using the backup copy of the connection table (put there by the synch demon) on the backup director. Once you've moved the VIP, the network needs to know that the VIP is associated with a new MAC address. To handle this, you can use Yuri Volobuev's send_arp distributied with the Linux-HA package (make sure you understand how arp works: see vip devices).

Director failover and realserver failover are logically separate, occur independantly and are done by different pieces of code e.g. MON only handles realserver failover.

Since both functionalities are required in a production LVS, some packages have them both. When configuring these packages you must remember that the director failover parts are logically separate from the realserver failover parts.

Both keepalived and Linux-HA handle director failover and to monitor the state of service(s) on the realserver. Keepalived has both functionalities in the same piece of code and uses one configure script. Linux-HA uses ldirectod to handle realserver failover. I think now that you set up Linux-HA/ldirectord with one configure script (not sure).

29.2. Single Point of Failure (SPOF) - you can't protect against everything

Redundancy is a method to handle failure in unreliable components. As a way of checking for unreliable components the concept of a "single point of failure" (spof) is used. However some components are much more reliable than others (e.g. a piece of multicored ethernet cable). You can safely not replicate them. Other components are much more expensive than others: it's expensive to replicate them. You are not looking for a fail-proof setup: you are looking for a setup which has a failure rate and cost that the customer can live with.

Mark Junk

I want to setup a lvs cluster firewall but i have only one ethernet cable from my isp...... So my question is how can i achieve this. without introducing a single point of failure? Essentilly i need to plug one cable into two boxes splitting at x

Joe

a hub/switch. They have low failure rates.

Yeah that would be a single point of failure though

Clint Byrum cbyrum (at) spamaps (dot) org 27 Oct 2004

You're already dealing with the cable from your ISP failing, the ultra redundant power going down, a meteor hitting the building, and their NOC tech setting fire to the routers. IMHO, If you really want to eliminate all SPOFs, you have to go multisite. At some point while dealing with the problems of going multisite, it just becomes ridiculous, and you have to ask yourself what your clientelle really need in terms of uptime.

Sebastien BRIZE sebastien (dot) brize (at) libertysurf (dot) fr

A simple but expensive way is to use a couple of Routing Switches (L3/L4) and double-attachement switches using RSTP (Rapid Spanning Tree Protocol) for the Switches attachement and MRP (Metro Ring Protocol) (or even RSTP) between both Routing switches.

  RS1 ------ RS2
   |          |
  S1          S2
   |          |
 FW1          FW2

RS1 and RS2 may be in different sites, and each equipment may have to power supply. This much more expensive than a cable though.

Dana Price d (dot) price (at) rutgers (dot) edu

I've got an Ultramonkey 3.0 LB-DR setup, with two directors. I have heartbeat running over eth0 and a crossover on eth1. Since both heartbeat links have to fail for a failover to occur, I'm concerned that something like a bad nic, cable, or switch will bring my web service down (say eth0 fails but the crossover eth1 is still up). Is there any way to define two heartbeat links in ha.cf but to have it failover if a designated one dies? That way the directors can still maintain state over the second link and I'd avoid the split-brained cluster that comes with only 1 HB link.

Joe

this may be possible and someone else can give you the answer, but I'll talk about something else...

There's only so many things you can worry about, so you pick the ones that are most likely to go.

The most likely problem is your network connection will go down - this is usually out of your control.

Next is mechanical things like disks and fans, or connectors not making good contact. This is the problem you have to deal with (- see below). Make sure you have ready-to-go copies of your disks, just sitting on the shelf next to the machine. You can update them by putting them in an external USB case and plugging them in somewhere, whenever you change your machine. Disks are really cheap compared to the cost of the labor of replacing them, or the cost of downtime. As well, pre-emptively swap out disks at their warrantee date.

Possibly you have unreliable power. Where I live in the US, I get a 1 sec power bump once a week, when the power company must be changing the power feed with a mechanical switch. You need a UPS. Such things are unheard of in more advanced parts of the world, like Europe, where you can have a machine up for 400 days on the regular power without any interruptions and UPS are not needed at all.

I've never had a NIC just fail. I (accidently) kicked the BNC connector on one and it died. I killed another with electrostatic shock by _not_ touching the computer case before putting my fingers near the empty RJ-45 socket. That's it - NICs generally don't die and neither do switches. The tcpip stack never locks up, unless the whole OS is hosed and that doesn't happen a whole lot with Linux and if it does, then heartbeat is gone too.

The connectors/cables to a NIC are another thing. Make sure your cables are multistranded and not a single strand for each wire. Flexing of single strand wire at the connector leads to cracks that show up as intermittant connections. Single strand has become the default since the .com boom, but they're only tolerated in the commodity market where people would rather save 1% cost than have a reliable connection. Nowhere else in the electronic industry are they used. There's probably not too much problem if the cables are just laid out and plugged in and left there without movement till the computer is junked, but if you're rearranging your cables frequently, use multicored cables.

Heartbeat has been used with LVS for years and we haven't had anyone come up with a split brain yet. (Maybe it happens and people dont think it worth mentioning.)

I would say that a pair of NICs with a single crossover cable is probably the most reliable part of your set up. I wouldn't bother making it redundant.

29.3. Stateful Failover

Anywhere that state information is required for continued LVS functioning, failover will have to transfer the state information to the backup machine. An LVS can have (some or all of) the following state information

  • director: ip_vs connection table (displayed with ipvsadm): i.e. which client is connected to which realserver.

    The Server State Sync Demon. will transfer this information to a backup director. If this information is not transferred, the client will loose their virtual service. For http, this is not a problem, as the client will get a new connection by hitting "refresh" on the browser.

  • realserver: ssl session keys

    When setting up https as a service under LVS, https is setup with persistence, so that the multiple tcp connections required for an ssl session will all go to the same realserver.

    On realserver failover, these session keys are lost and the client has to renegotiate the ssl connection. Presumably other persistent information, which is much more important (e.g. shopping cart or database), is being stored on the LVS in a failover safe manner. Compared to loss of the customer data, loss of session keys is not a big deal and we are not working on a solution for this.

  • realserver: persistent data e.g. shopping cart on e-commerce sites.

    To allow customers to make purchases over an arbitary long period and for their session to survive failover of the realserver to which they are connected, their database information needs to be preserved in a place where any realserver can get to it. Originally this was done with cookies (see the section on Section 13.9.1), but these are instrusive. Cookies can be stolen or poisoned and many people turn them off (clients shouldn't be allowing non-trusted machines to write anything on their computer). All customer state information should instead be stored at the LVS site (see the section on persistent connection).

    If you store persistent data on the virtual server, you must write your application to survive failover of the realserver and long timeouts. (The customer should be able to bring up information about vacations and leave it on the screen for the spouse to inspect when they come home. The spouse should be able to click to the next piece of information without the application crashing.)

  • tcp state: filter rules on the director and/or realserver

    This information is one level lower on the ISO network diagram than is the ipvsadm connection information. Any particular client can make many tcp connections to a realserver.

    The director is a router (admittedly with slightly different rules than the normal routers) and as such just forwards packets. On failover, a director configured with no filter rules, can be replaced with an identically configured backup with no interuption of service to the client. There will be a time in the middle of the changeover where no packets are being transmitted (and possibly icmp packets are being generated), but in general once the new director is online, the connection between client and realserver should continue with no break in established tcp connections between the client and the realserver.

    If the director has only stateless filter rules, then the director still appears as a stateless router and director failover will occur without interruption of service.

    With iptables, a router (e.g. an LVS director) can monitor the tcp state of a connection, (e.g. NEW, RELATED, ESTABLISHED). If stateful filter rules are in place (e.g. only accept packets from ESTABLISHED connections) then after failover, the new director will be presented packets from tcp connections that are ESTABLISHED, but of which it has no record. The new director will REJECT/DROP these packets.

    Harald Welte (of netfilter) is in the process of writing code for stateful failover of netfilter.

    Ratz, 01 Jun 2004

    He has actually done it (http://cvs.netfilter.org/netfilter-ha/ link dead Feb 2005) and we can expect it to surface the user's world in a couple of months for beta testing.

    Each new rule has to find its place in the existing rule set resulting in an n^^2 loading of rules. It can take seconds for 50,000 rules to load. This is also being worked on with the new pkttables, ct/nf-netlink and whatever else the nf guys come up with. For large rule sets, use hipac Hipac (http://www.hipac.org/).

    Even though a highly available form of stateful netfilter is now available, it really doesn't affect LVS because

    • LVS controlled packets do not traverse the netfilter framework in the normal manner and iptables is not aware of all of the transfers of packets.
    • LVS does its own connection tracking. Untill early 2004, this was not particularly complete, but Julian has beta grade code out now which should satisfy most users (see Running a firewall on the director). This code allows stateful tracking of LVS controlled packets.
    • Failover of tcp connection state is already handled by Server State Sync Demon. Note that the synch demon only cares whether a connection is ESTABLISHED and only copies connections which are ESTABLISHED to the backup director. Connections in FIN_WAIT etc will timeout on their own and the backup director doesn't need to know about these states on becoming the master director.

    The situation on the realserver is a little different. If a realserver fails, for most services, there is no way to transfer the connection to a backup realserver and the connection to the client is lost anyway. In this case stateful filter rules on the director will not cause any extra problems with failover.

Summary: statefull filter rules are allowed on the realservers anytime you like. Statefull rules are allowed on the director only if you use Julian's nfct patches.

octane indice octane (at) alinto (dot) com 13 Apr 2006

Do you know if you can do something like carp+pfsync with linux+ipvs. My goal is to have two director/firewall machines: a master and a backup, Both sharing the same IP: VIP I can handle the LVS part easily with keepalived and a VRRP method and same ruleset but it means that all the connections tracked by the firewall rules are lost when master comes down. I first want a firewall with failover. _Then_ if it works, I would add director on top of it. I want to use it under linux. So the carp/pfsync solution is not available.. The question is: the sync daemon is helpful with me to synchronize firewalls state or not?

I read then http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.server_state_sync_demon.html

but I saw:"Note that the feature of connection synchronization is under experiment now, and there is some performance penalty when connection synchronization, because a highly loaded load balancer may need to multicast a lot of connection information. If the daemon is not started, the performance will not be affected. "

and from: http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.failover.html

Honestly, as good as LVS is for real server load balancing, for firewalls I like OpenBSD with CARP and pfsync. CARP+pfsync provides easy, scalable load balancing and HA for firewalls. pf, the OpenBSD firewall, is very well written and nicely designed. Give it a look, www.openbsd.com. Note Carp is available for Linux too. " yes carp is available for linux but not pfsync which is what I need.

I asked Julian about http://www.ssi.bg/~ja/nfct/ "Does it means that master firewall will updates backup firewall with its conntrack state?". His answer was "No". Seems to be that there is no way to use a cluster firewall with conntrack replication under linux.

Joe

no-one has posted that they've done it. Any protocol that updates state information onto a backup machine is going to have overhead. pfsync updates the firewall state (I believe) on the backup, but not the ipvs connection table. Even with carp, you still have to transfer the ipvs table. The ipvs synch state demon only keeps track of the ipvs controlled connections, not the firewall state - it won't help you.

Ratz 20 Apr 2006

IPVS has not much to do with firewalling, you can achieve CARP+pfsync like setups using VRRP+ctsync under Linux.

Does ctsync not work? I know that you've also asked in the nf-failover ml. It's sort of maintained (there have been a couple of patches to ct_sync this year already) and it sort of works for the handful of people that actually use it. It had problems with tcp window tracking the last time I tried it but Krisztian and Harald are certainly more than happy to fix a couple of issues related to ctsync problems. People send in patches to ct_sync regularly to netfilter-devel and some even maintain out of tree kernel patches: http://vvv.barbarossa.name/files/ct_sync/ Please try out the available software and if this does not work, complain at netfilter-dev ml ;).

29.4. Director failure

What happens if the director dies? The usual solution is duplicate director(s) with one active and one inactive. If the active director fails, then it is switched out. Although everyone seems to want reliable service, in most cases people are using the redundant directors in order to maintain service through periods of planned maintenance rather than to handle boxes which just fail at random times. At least no-one has admitted to a real director failure in a production system.

Matthew Crocker matthew (at) crocker (dot) com 23 May 2002

I have a production LVS server running with 3 realservers handling SMTP, POP3, IMAP for our QMAIL server. We process about a million inbound connections a day. I've never had the primary LVS server crash but I have shut it down on purpose (yanked the power cord) to test the fail over. Everything worked perfectly.

We use QMAIL-LDAP for our mail server and Courier-IMAP for the IMAP server. QMAIL saves mail in Maildir format on our NFS server (Network Appliance F720) as a single qmailu user. All aliases, passwords, mail quota information is stored in LDAP (openldap.org). The cluster is load balanced using LVS currently with Direct Routing but I'm going to switch to NAT very soon.

Bradley McLean bradlist (at) bradm (dot) net 23 May 2002

We run a pair of load balancers in front of 5 real http/https webservers, using keepalived. In earlier versions of LVS, a memory leak problem caused a failover to occur about once every three days (might have been 0.9.8 with keepalived 0.4.9 + local patches). We're on 1.0.2 and 0.5.6 now, with no problems, except that we don't quite have an auto failback mechanism that works correctly.

We preserve connections quite nicely during the failover from the master to the backup, however once in that state, if the master comes back up, it takes over without capturing the connection states from the backup. I believe that Alexandre is close to solving this if he hasn't already; frankly we've been concentrating on other pieces of our infrastructure, and since we've had no failures since we upgraded versions, we haven't been keeping up.

We're relatively small, serving up between .5 and 2.5 T1's worth of traffic. The balancers are built from Dell 2350s with 600Mhz PIII and 128MB, with DE570TX quad tulip cards in each.

We run NAT, with an external interface that provides a non-routable IP address (there's a separate firewall up front before the web cluster), an internal interface to our web servers, and internal interface to our admin / backup network, and an interface on a crossover cable to the other balancer used for connection sync data. We could consolidate some of these, but since NICs are cheap, it keeps everything conceptually simple and easy to sniff to prove it's clean.

Magnus Nordseth magnun (at) stud (dot) ntnu (dot) no 23 May 2002

We have been running lvs in a production site for about 8 months now, with functional failover for the last 3.

We use keepalived for failover and healthchecking. The setup consists of 4 realservers (dell 2550, dual pIII 933, 1Gb RAM) and 2 directors (pII 400). The main director has only been down for maintenance or demonstration purposes. The site has about 2 million hits per day, and the servers are pushing between 20 and 70 Gb of data each day.

Automatic detection of failure in unreliable devices by other unreliable devices is not a simple problem. Currently LVS director failure in an LVS is handled by code from the Linux HA (High Availability) project. (Alexandre Cassen is working on code based on vrrpd, which will also handle director failover). The Linux HA solution is to have two directors and to run a heartbeat between them. One director defaults to being the operational director and the other takes over when heartbeat detects that the default director has died.

                        ________
                       |        |
                       | client |
                       |________|
			   |
                           |
                        (router)
                           |
			   |
          ___________      |       ___________
         |           |     |  DIP |           |
         | director1 |-----|------| director2 |
         |___________|     |  VIP |___________|
               |     <- heartbeat->    |
               |---------- | ----------|
                           |
         ------------------------------------
         |                 |                |
         |                 |                |
     RIP1, VIP         RIP2, VIP        RIP3, VIP
   ______________    ______________    ______________
  |              |  |              |  |              |
  | realserver1  |  | realserver2  |  | realserver3  |
  |______________|  |______________|  |______________|

LVS is one of the major uses for the Linux HA code and several of the Linux HA developers monitor the LVS mailing list. Setup problems can be answered on the LVS mailing list. For more detailed issues on the working of Linux HA, you'll be directed to join the Linux HA mailing list.

Fake, heartbeat and mon are available at the Linux High Availability site.

There are several overlapping families of code being developed by the Linux HA project and the developers seem to contribute to each other's code. The two main branches of Linux HA used for LVS are UltraMonkey and vrrpd/keepalived. Both of these have their own documentation and are not covered in this HOWTO.

29.5. UltraMonkey and Linux-HA

The UltraMonkey project [14] is a packaged version of LVS combined with Linux HA to give director failure, written by Horms. It uses LVS-DR and is designed to load balance on a LAN. UltraMonkey uses Heartbeat from the Linux-HA project for failover and ldirectord to monitor the realservers.

Alternatively the setting up Linux-HA from rpms has been written by Peter Mueller. This being functionally equivelent to the Ultra Monkey code.

29.5.1. Two box HA LVS

Doug Sisk sisk (at) coolpagehosting (dot) com 19 Apr 2001

Is it possible to create a two server LVS with fault tolerance? It looks straight forward with 4 servers ( 2 Real and 2 Directors), but can it be done with just two boxes, i.e. directors, each director being a realserver for the other director and a realserver running localnode for itself?

Horms

Take a look at www.ultramonkey.org, that should give you all the bits you need to make it happen. You will need to configure heartbeat on each box, and then LVS (ldirectord) on each box to have two realservers: the other box, and localhost.

29.5.2. heartbeat and connection state synch demon

Michael Cunningham m (dot) cunningham (at) xpedite (dot) com

I have heartbeat running between two LVS directors. It is working great. It can fail back and forth without issues. Now I would like to setup connection state synchronization between the two directors. But I am two problems/questions. Can I run the multicast connection sync over my 100 mbit private lan link which is being used by heartbeat?

How can I setup heartbeat to always run..

director:/etc/lvs# ipvsadm --start-daemon=master --mcast-interface=eth1

on the current master.. and

director:/etc/lvs# ipvsadm --start-daemon=backup --mcast-interface=eth1

on the current slave at all times?

The master can run a script when it starts up/obtains resources but I don't see anyway for the slave to run a script when it starts up or releases resources.

Lars Marowsky-Bree lmb (at) suse (dot) de 02 Feb 2002

The slave runs the resource scripts with the "stop" action when the resources are released, so you could it add in there; anything you want to run before the startup of heartbeat is separate from that and obviously beyond the control of heartbeat.

You are seeing the result of heartbeat's rather limited resource manager, I am afraid.

29.5.3. serial connection problems with Linux-HA

"Radomski, Mike" Mike (dot) Radomski (at) itec (dot) mail (dot) suny (dot) edu 13 Mar 2002

I am experiencing a strange problem with my LVS+Heartbeat cluster. I have two systems both running ipvsadm and heartbeat(serial and x-over Ethernet). Every 10 hours I get a cpu spike (load of 1.1) on the primary system and then a few minutes later I get the same spike on the secondary system. The system sustains a load of ~1 for about 20 minutes and then returns to ~0. Neither top nor ps are showing the active process causing the spike. The spike lasts for about 20 minutes and then everything is fine. The ipvsadm piece still redirects and load balances with no viewable performance problem. Is there any thing I can do to track this problem down?

Lars

A load of 1.1 doesn't mean a CPU spike; it might simply mean that there is a zombie process for some reason. ps should show this (a process in D or Z state); ps fax will show you a process tree so you can figure out where it came from.

It has been over 10 hours since the last sustained spike. I remember when setting up the heartbeat, the serial connection was very slow and intermittent. If I cat'ed information to the serial port, it would take about 30 seconds to reach the other end of the null modem cable. As per a suggestion on Google Groups, I tried to set the serial port with the following:

/bin/setserial /dev/ttyS1 irq 0

This worked both with simple serial communication and heartbeat. But I found in the dmesg and the logs the following:

ttyS: 2 input overrun(s)
ttyS: 2 input overrun(s)

After shutting off the serial heartbeat, the over all load dropped about .02. I have not seen the sustained spike since.

Lars

It is something in the kernel, because that is the only thing which is not accounted for by anything else. There are other options; if you boot the kernel with the "profile=3D2", you can use readprofile to compare the patterns for 5 minutes during the events and outside; remember to use "readprofile -r" to reset the profiling data when doing so so the counters are clean and do NOT run top during the time, because top traverses the /proc file system every second or so which greatly obscures the profiling results.

Paul Baker pbaker (at) where2getit (dot) com 14 Mar 2002

I had major reliability issues when I tried using the serial connection with heartbeat. I attributed it to poor chipset design from Intel. My load balancers are 1U's Celeron systems that use that crappy i810 chipset. Pretty much whenever there was any load on the server (such as during rsync replication between the master and slave loadbalancers), the serial connection would completely timeout. Which would cause complete havoc on my lvs. The slave would then think the master was down, and start to bring itself up as the director. Let me tell you from experience, it really sucks when you have two directors fighting over arp for the ip addresses of the lvs. So I just decided to bite the bullet and switch from serial heartbeats to udp. I haven't had a problem since.

29.6. Keepalived and Vrrpd

Alexandre Cassen alexandre (dot) cassen (at) wandadoo (dot) fr, the co-author of keepalived (http://keepalived.sourceforge.net) and the author of LVSGSP has produced keepalived which sets up an LVS and monitors the health of the services on the realservers and has produced a vrrpd demon for LVS which enables director failover. You build one executable keepalived which has (optionally one or both of) the vrrpd and keepalived functions. If you just want failover between two nodes, you only need the vrrpd part of the build.

(notes here produced from discussions with Alexandre). Keepalived will

  • setup an LVS from scratch (services, forwarding method, scheduler, realservers)
  • monitor the services on the realservers and failout dead services on the realservers
  • failover machines (for LVS, this will be the directors)

There are examples of using keepalived/vrrpd in a HOWTO for LVS-NAT and another HOWTO for LVS-NAT with the director being patched with the Running a firewall on the director. to act as a firewall as well. The options available for the keepalived.conf are available in doc/keepalived.conf.SYNOPSIS. Sample keepalived.conf files are in ./doc/samples/keepalived.conf.* in the source directory. An elementary set of manpages are available.

The functionality for the vrrpd failover is similar to that for heartbeat. vrrpd adds IP(s) to ethernet card(s) with ip, when the machine is in the master state and removes them, when it is in the backup state. On bringing up the IP(s), vrrpd sends a gratuitous arp for the new location of the IP, flushing the arp tables of other machines on the network. This procedure leaves the arp tables unchanged for the other (unmoving) IP(s) on the same interface.

Note

There is some confusion about patents connected to VRRP. Here is some info.

FreeBSD has CARP (http://pf4freebsd.love2party.net/carp.html), the Common Address Redundancy Protocol, written to head-off possible problems with cisco claims that its patents on Hot Standby Router Protocol (HSRP) cover the same technical areas as VRRP (http://software.newsforge.com/software/04/04/13/1842214.shtml). Alexandre has contacted cisco about this. CARP has been ported to Linux (http://www.ucarp.org/).

Alexandre Cassen acassen (at) freebox (dot) fr 9 Mar 2006

CARP is close to VRRP - they have the same Finite State Machine (FSM). The patent on VRRP is not applicable to Keepalived since I made some assumptions that make the implemetation not as rfc compliant as other implemenations. The VRRP patent for linux implementation is not a problem. The CARP code, except the use of hash instead of IP address, and some others cosmetics stuff, is VRRP like. VRRP is an IETF standard. IMHO, what is important for such a protocol is not re-inventing the FSM (by writing CARP), but stacking components around to make it usefull (like sync_group, ....). VRRP adoption is already made and if CARP doesn't bring new inovation concepts, this will slow down adoption.

S.Mehdi Sheikhalishahi 2005-05-21

Is there any comparsion between Load Balancing and HA Solutions? What's the best for a firewall?

Clint Byrum cbyrum (at) spamaps (dot) org 2005/22/05

as good as LVS is for real server load balancing, for firewalls I like OpenBSD with CARP and pfsync. CARP+pfsync provides easy, scalable load balancing and HA for firewalls. pf, the OpenBSD firewall, is very well written and nicely designed. Give it a look, www.openbsd.com.

Alexandre 31 Dec 2003

Gratuitous ARP is well supported by routing equipment. Only one packet is lost during takeover.

In earlier versions of vrrpd, the vrrpd fabricated a software ethernet device on the outside of the director (for the VIP) and another for the inside of the director (for the DIP) each with a MAC address from the private range of MAC addresses (i.e. will not be found on any manufactured NIC). When a director failed, vrrpd would re-create the ethernet devices, with the original IP and MACs, on the backup director. Other machines would not have any changes in their arp tables (the IP would move to another port on a switch/hub though) and would continues to route packets to the same MAC address. Unfortunately this didn't work out

Alexandre 31 Dec 2003

We discussed this with Julian, and Jamal. The previous code didn't handle the VMAC cleanly. It consisted of changing the interface MAC address inside the kernel to fake the needed one... This is not clean and not scalable since this restricted us to only one VMAC per interface (multiple VMACs were not supported). Later Julian produced his parp netlink patch that offers an arp reply from a VMAC. This did not work, as all traffic stayed with the interface MAC. Later, on the netdev ML, we discussed this with Julian and Jamal, and the best solution was to provide a patch to the ingress/egress code to support these VMAC operations. This code hasn't been written.

keepalived listens on raw:0.0.0.0:112, so you can include in /etc/services.

vrrpd           112/raw         vrrpd

Here's part of the output of netstat -a after starting keepalived on a machine with two instances of vrrpd (one for each interface) (there are no other machines running vrrpd on the network).

director# netstat -a | grep vrrp
raw        0      0 *:vrrpd                 *:*                     7
raw        0      0 *:vrrpd                 *:*                     7
raw        0      0 *:vrrpd                 *:*                     7
raw        0      0 *:vrrpd                 *:*                     7
unix  2      [ ACC ]     STREAM     LISTENING     5443463 /tmp/.vrrp
unix  3      [ ]         STREAM     CONNECTED     5443487 /tmp/.vrrp

After starting keepalived on another director, note that one of the vrrpds has received some packets.

director# netstat -a | grep vrrp
raw        0      0 *:vrrpd                 *:*                     7
raw        0      0 *:vrrpd                 *:*                     7
raw      264      0 *:vrrpd                 *:*                     7
raw        0      0 *:vrrpd                 *:*                     7
unix  2      [ ACC ]     STREAM     LISTENING     5456088 /tmp/.vrrp
unix  3      [ ]         STREAM     CONNECTED     5456118 /tmp/.vrrp

Although netstat only shows that vrrpd is bound to 0.0.0.0:vrrpd, if you are wondering how to write your filter rules, vrrpd is only bound to the NIC specified in keepalived.conf. VRRP advertisements are sent/received on this protocol socket, using multiplexing.

The src_addr of the multicast packet is the primary IP of the interface. Multicast permits you to alter the src_addr (with mcast_src_ip) if you want to hide the primary IP. If you do this, the socket will still be bound to 0.0.0.0 according to netstat

(Here is further info on multicast without IPs on NICs.)

Alexandre Dec 2003

VRRP is interface specific (like HSRP and others hot standby protocol). and uses a socket pair for sending/receiving adverts. The sockets are bound to the specified interface. When you configure a VRRP instance on interface on eth0, VRRP will create a raw vrrp-proto socket and bind the socket to interface eth0 (using bindtodevice kernel call). Then it joins the VRRP multicast group. So this socket will receive VRRP adverts only on eth0. The same thing is done for sending socket, vrrp proto sending socket is bound to interface VRRP instance belong to. Additionnaly if you have more than one VRRP instance on the same NIC (for active/active setup) then they will share the same socket. VRRP code will then demux the incoming VRRP adverts performing a hash lookup according to the incoming VRRP advert VRID header field. It performs a o(1) lookup (hash index based on VRRP VRID field). If you run IPSEC-AH VRRP and normal VRRP on the same interface then the code will create 2 sockets referring to each protocol (51 and 112). The rest is the same: demux on a shared socket according to incoming VRRP VRID field.

VRRP is based on advert sending over multicast (advert interval determined by advert_int in keepalived.conf normally configured for 1 sec). This is an election protocol. Master is the one with the highest priority. When the master crashes, an election is held to find the next highest higher priority VRRP master.

keepalived has the same split brain problem as heartbeat. heartbeat tries to beat this by having multiple communication channels. vrrpd only has one channel.

There isn't a keepalived status, so you can't programmatically determine the state of any machine. You can look for the moveable IP with ip addr show. You can also inspect the logs (look for "BACKUP" || "MASTER").

Note
Unfortunately as with any failover setup, failover is not guaranteed in the case of a sick machine. If one machine is in an error state e.g.vrrpd dies on the master machine, the logs will show the last entry as MASTER (but it will be an old entry), while another machine which takes over the master role, will have a (current) entry as MASTER in the logs. Presumably you could use notify_master, notify_backup and notify_fault to touch files which you could inspect later to determine the state of the machines. This will have problems too in error states. Inspection of IP(s) will also be meaningless in an error condition. The current mechanism for handling machines in a dubious state is to programmatically power cycle them (a process called STONITH). Hopefully the good machine reboots the sick machine.

There is no way to force a master-backup transition (e.g. for testing). However, you can relink keepalived.conf to a file with a lower priority and re-HUP keepalived.

You can force a master-fault transition by downing the interface.

vrrpd only works with PCI ethernet cards (all of which have an MII transceiver) but not with ISA ethernet cards (which don't). I have a machine with 3 ethernet cards, 1 PCI and 2 (identical) ISA. vrrpd works with the PCI card and one of the ISA cards (making transitions on failover), but not with the last ISA card (eth2) which vrrpd detects as being in a FAULT condition (but vrrpd doesn't execute the "to_fault" script).

Alexandre Cassen, 18 Jan 2004

In fact, ISA cards don't support MII since they don't have a MII transciver. So media link detection will not work on your 3c509.

Joe

I don't understand why the option "state MASTER|BACKUP" exists, since whatever value I use there is overriden by the election which occurs about 3secs after vrrp comes up. It doesn't help with the split brain problem (not much does).

In fact this is just a kind of speed bootstrap strategy... But you are right, an election follows anyway. And then you can have node configured for BACKUP with a priority higher than a node configured for MASTER.

OK, I will set all states to MASTER and let them have an election.

Behaviour on killing vrrpd

If I kill keepalived, I would like the machine to run the scripts in "to_backup". At the moment I'm running the "to_backup" scripts in the rc.keepalived stop init file before it kills vrrpd. After vrrpd is shut down, a vrrpd on another machine will become master and I would like the machine where vrrpd has been shutdown to be in the BACKUP configuration (not have the movable IPs or point to the wrong default gw). I can't think of a reason why vrrpd should leave the machine in one state or another when it exits. Has the behaviour I see been chosen after some thought or is it just how it works right now?

Currently I have not implemented 'administration state forcing' that overrides the runing vrrp FSM. I hope I will find time for this.

logs

I would like the logs to show not only the state of vrrpd on that machine, but following an election or transition, I would like to know which other machines were involved and what state they are in. At the moment the logs don't tell me whether a machine became master because it won the election or it didn't find any other machines. Is it possible to have more logging info like this in a future version of keepalived? I'd like to be able to look at a log file and see what state a machine thinks it is in and what state it thinks all the other machines are in.

In fact this is a VRRP spec. I mean, during election, if a node receives a higher prio advert then it will transit to backup state, since this is multicast design the master node will not hear remote 'old master' transition, since it has won the election. The VRRP specs doesn't support kind of LSA database like OSPF provides (each node knowing the state of the others). I spoke with the IETF working group about this last year but this features didn't receive much echo :). But this can be nice.... I like this :) kind of admin command line requesting neighbor, ... this can be usefull.

Padraig Brady padraig (at) antefacto (dot) com 22 Nov 2001

Haven't Cisco got patents on vrrpd? What's the legal situation if someone wanted to deploy this?

Michael McConnell michaelm (at) eyeball (dot) com

no - ftp://ftp.isi.edu/in-notes/rfc2338.txt

Andre

In short yes : http://www.ietf.org/ietf/IPR//VRRP-CISCO, IBM too : http://www.ietf.org/ietf/IPR/NAT-VRRP-IBM

In fact there is 2 patents ((http://www.foo.be/vrrp/ link dead Feb 2005)

  • CISCO - http://www.delphion.com/details?pn=US06108300__
  • Nortel Network - http://www.delphion.com/details?pn=EP01006702A3

When you read this papers you can't find any OpenSource restriction... All that I can see is the commercial product implementation... I plan to post a message into the IETF mailing list to present the LVS work on VRRP and to enlarge the debate on OpenSource implementation and eventual licence...

9 Jan 2002

answer from Robert Barr, CISCO Systems

Cisco will not assert any patent claims against anyone for an implementation of IETF standard for VRRP unless a patent claim is asserted against Cisco, in which event Cisco reserves the right to assert patent claims defensively.

I cannot answer for IBM, but I suspect their answer will be different.

29.7. monitoring/failover messages should stay internal to LVS

The LVS server state synch demon, vrrpd and heartbeat need to send messages between the backup and active director. You can send these over the RIP network, or via a dedicated network, but you shouldn't send these packets though the NIC that faces the internet (the one that has the VIP). Reasons are

  • It allows outside people to hack your boxes.
  • The LVS (director, realservers), must appear as a single (highly available) server. The clients must not be able to tell that the server is composed of serveral machines working together. The clients must not be able to see heartbeat packets.

29.8. Parsing problems with vrrpd config file

(Apr 2006, from several people). The parser is a little touchy.

This format is in the manpage and doesn't work.

vrrp_sync_group ACDDB_mysql_eins {
        group { vip_mysql_eins }
}

vrrp_sync_group ACDDB_mysql_zwei {
        group { vip_mysql_zwei }
}

This works.

vrrp_sync_group ACDDB_mysql_eins {
         group {
                 vip_mysql_eins
         }
}


vrrp_sync_group ACDDB_mysql_zwei {
         group {
                 vip_mysql_zwei
         }
}

This doesn't

MISC_CHECK { misc_path "/path/to/script" }

But this one does work.

MISC_CHECK {
   misc_path "/path/to/script"
}

29.9. Two instances of vrrpd

It is possible to have two independant instances of vrrpd handing two VIPs, which can migrate independantly between directors.

Alexandre

you can have 2 VRRP VIPs active on different routers... then you have a VRRP configuration with 2 instances. On Director1 one instance with VIP1 in a MASTER state one instance for VIP2 in a BACKUP state. => symetry for Director2. both instances on the same interface with a different router_id.

Alex alshu (at) tut (dot) by

see http://keepalived.org/pdf/LVS-HA-using-VRRPv2.pdf http://keepalived.org/pdf/UserGuide.pdf

Graeme Fowler graeme@graemef (dot) net 27 Apr 2006

The router_id needs to be the same on each director for each vrrp_instance. That value is sent out in the advertisement and is necessary for the pair of directors to synchronise. You only need different priorities on your primary and failover director. You can start up both as MASTER or BACKUP and let them decide according to priority, what does what. Just make sure the router_id values are the same for each instance.

29.10. HA MySQL

Dominik Klein klein (dot) dominik (at) web (dot) de 25 Apr 2006

Goal: My goal is a HA MySQL database. As the MySQL cluster storage engine lacks several important features (like foreign keys e.g.), I cannot use a MySQL cluster. So now I use MySQL replication in a master-to-master-setup. As my clients are able to re-connect after a connection loss, but cannot connect to a different IP on connection loss, a VIP setup is the goal. So my clients only know the VIP(s), not the real IPs of the MySQL Servers.

Setup: I have two machines. Each machine runs keepalived and MySQL. Each machine has 2 NICs. eth0 going to the switch, eth1 connecting SRV1 and SRV2. My setup looks like this:

Intranet
|
|
##SWITCH##
|	|
|	|
|	|
SRV1---SRV2

Clients connect through the switch, replication is done over the direct gigabit connection between SRV1 and SRV2.

SRV1 IPs:
eth0 10.6.10.20
eth1 10.250.250.20
SRV2 IPs:
eth0 10.6.10.21
eth1 10.250.250.21

Virtual Services I need two VIPs, as I want write-queries to go to SRV1, and read-queries to go to SRV2 - just as in a normal replication-setup, for loadbalancing-purposes. Note that it is not keepalived or LVS that does the loadbalancing here, as each virtual service only has one realserver and one sorry-server! "Loadbalancing" is just writing-to-the-database-software connecting to one server, reading-from-the-database-software connecting to another server.

10.6.10.24:3306
SRV1 (MASTER state for this VIP)
Realserver: 127.0.0.1:3306
Sorryserver: 10.250.250.21:3306
SRV2 (BACKUP state for this VIP)
Realserver 10.250.250.20:3306
Sorryserver: 127.0.0.1:3306

10.6.10.240:3306
SRV1 (BACKUP state for this VIP)
Realserver 10.250.250.21:3306
Sorryserver: 127.0.0.1:3306
SRV2: (MASTER state for this VIP)
Realserver: 127.0.0.1:3306
Sorryserver: 10.250.250.20:3306

So this is basically the "localhost"-feature, plus one sorryserver per virtual service.

Failover If one of the eth0 network connections fail, the VIP moves to the other director, but connections still get directed to the same MySQL server. So the MySQL-loadbalancing still works. If MySQL fails on one machine, connections are redirected to the other server's eth1-IP (10.250.250.2[01]). In order to be able to route that back over the director it came from, there are ip-rules on each server:

------------------------------
- SVR1 ip rules and routing: -
------------------------------

cat /etc/iproute2/rt_tables
2 mysqlrouting
...

ip rule show
...
32765:  from 10.250.250.20 lookup mysqlrouting
...

ip route show table mysqlrouting
default via 10.250.250.21 dev eth1

Setup-steps for this:
echo "2 mysqlrouting" > /tmp/rt_tables
cat /etc/iproute2/rt_tables >> /tmp/rt_tables
cp /tmp/rt_tables /etc/iproute2/rt_tables
rcnetwork restart
ip rule add from 10.250.250.20 table mysqlrouting
ip route add default via 10.250.250.21 dev eth1 table mysqlrouting

------------------------------
- SVR2 ip rules and routing: -
------------------------------
cat /etc/iproute2/rt_tables
2 mysqlrouting
...

ip rule show
...
32765:  from 10.250.250.20 lookup mysqlrouting
...

ip route show table mysqlrouting
default via 10.250.250.20 dev eth1

Setup-steps for this:
echo "2 mysqlrouting" > /tmp/rt_tables
cat /etc/iproute2/rt_tables >> /tmp/rt_tables
cp /tmp/rt_tables /etc/iproute2/rt_tables
rcnetwork restart
ip rule add from 10.250.250.21 table mysqlrouting
ip route add default via 10.250.250.20 dev eth1 table mysqlrouting

Configuration files

------------------------------------
- keepalived configuration on SRV1 -
------------------------------------

! Configuration File for keepalived

global_defs {
    notification_email { [email protected] }
    notification_email_from [email protected]
    smtp_server 10.2.20.6
    smtp_connect_timeout 30
    lvs_id TEST-MYSQL-1
}

vrrp_sync_group test_mysql_one {
         group {
                 vip_mysql_one
         }
}

vrrp_sync_group test_mysql_two {
         group {
                 vip_mysql_two
         }
}

vrrp_instance vip_mysql_one {
     state MASTER
     interface eth0
     virtual_router_id 51
     priority 100
     advert_int 1
     authentication {
         auth_type PASS
         auth_pass 12345
     }
     virtual_ipaddress {
         10.6.10.24/24 brd 10.6.10.255 dev eth0
     }
}

vrrp_instance vip_mysql_two {
     state BACKUP
     interface eth0
     virtual_router_id 52
     priority 10
     advert_int 1
     authentication {
         auth_type PASS
         auth_pass 12345
     }
     virtual_ipaddress {
         10.6.10.240/24 brd 10.6.10.255 dev eth0
     }
}

virtual_server 10.6.10.24 3306 {
     delay_loop 6
# lb_algo is actually not important, as we have only one real_server
     lb_algo wlc
     lb_kind NAT
     nat_mask 255.255.255.0
     protocol TCP
     real_server 127.0.0.1 3306 {
         TCP_CHECK {
                 connect_port 3306
                 connect_timeout 30
         } #TCP_CHECK
     }
     sorry_server 10.250.250.21 3306
}

virtual_server 10.6.10.240 3306 {
     delay_loop 6
# lb_algo is actually not important, as we have only one real_server
     lb_algo wlc
     lb_kind NAT
     nat_mask 255.255.255.0
     protocol TCP
     real_server 10.250.250.21 3306 {
         TCP_CHECK {
                 connect_port 3306
                 connect_timeout 30
         } #TCP_CHECK
     }
     sorry_server 127.0.0.1 3306
}

------------------------------------
- keepalived configuration on SRV2 -
------------------------------------

! Configuration File for keepalived

global_defs {
    notification_email { [email protected] }
    notification_email_from [email protected]
    smtp_server 10.2.20.6
    smtp_connect_timeout 30
    lvs_id TEST-MYSQL-2
}

vrrp_sync_group ACDDB_mysql_one {
         group {
                 vip_mysql_one
         }
}

vrrp_sync_group ACDDB_mysql_two {
         group {
                 vip_mysql_two
         }
}

vrrp_instance vip_mysql_one {
     state BACKUP
     interface eth0
     virtual_router_id 51
     priority 100
     advert_int 1
     authentication {
         auth_type PASS
         auth_pass 12345
     }
     virtual_ipaddress {
         10.6.10.24/24 brd 10.6.10.255 dev eth0
     }
}

vrrp_instance vip_mysql_one {
     state MASTER
     interface eth0
     virtual_router_id 52
     priority 100
     advert_int 1
     authentication {
         auth_type PASS
         auth_pass 12345
     }
     virtual_ipaddress {
         10.6.10.240/24 brd 10.6.10.255 dev eth0
     }
}

virtual_server 10.6.10.24 3306 {
     delay_loop 6
# lb_algo is actually not important, as we have only one real_server
     lb_algo wlc
     lb_kind NAT
     nat_mask 255.255.255.0
     protocol TCP
     real_server 10.250.250.20 3306 {
         TCP_CHECK {
                 connect_port 3306
                 connect_timeout 30
         } #TCP_CHECK
     }
     sorry_server 127.0.0.1 3306
}

virtual_server 10.6.10.240 3306 {
     delay_loop 6
# lb_algo is actually not important, as we have only one real_server
     lb_algo wlc
     lb_kind NAT
     nat_mask 255.255.255.0
     protocol TCP
     real_server 127.0.0.1 3306 {
         TCP_CHECK {
                 connect_port 3306
                 connect_timeout 30
         } #TCP_CHECK
     }
     sorry_server 10.250.250.20 3306
}

As MySQL requires some specific configuration, I will briefly post the relevant parts, but not go into detail here, because it is actually OT for this list. Read the MySQL-Documentation for further detail, if you do not understand the configuration parts below: http://dev.mysql.com/doc/refman/5.0/en/replication.html

-------------------------------
- MySQL configuration on SRV1 -
-------------------------------

log-bin=mysql-bin
log-slave-updates

server-id       = 5000

auto_increment_increment=2
auto_increment_offset=1

master-host     =   10.250.250.21
master-user     =   replication
master-password =   replication
master-port     =   3306

-------------------------------
- MySQL configuration on SRV1 -
-------------------------------

log-bin=mysql-bin
log-slave-updates

server-id       = 5001

auto_increment_increment=2
auto_increment_offset=2

master-host     =   10.250.250.20
master-user     =   replication
master-password =   replication
master-port     =   3306

On failover, there is no connection-sync, so every client has to re-connect. Connection-sync is imho not possible in this setup, as real-servers are different on SRV1 and SRV2.

This example is on VIP1: If MySQL fails on SRV1, SRV2 will be used. When SRV1 comes back up, keepalived will immediately switch back to SRV1. This will send clients to a mysql server, that may not have up-to-date-data. As I could not find a way to define any delay before the real-server is added back in, I wrote a MySQL startscript I'll post (below). It blocks port 3306 on loop interface and so lets the healthcheck fail for local mysql, starts mysql, waits for the replication to get new data from SRV2 and then unblock port 3306, so the healthcheck can pass again. After that successful healthcheck, the real-server is inserted by keepalived and clients should have up-to-date data.

Another thing one has to be aware of in such a setup is the fact, that the client will not get anything from SRV2, when SRV1 had crashed and comes back up. So socket connections to SRV2 will still be "thought of" as "OK". In order to tell clients they are not OK, i additionaly set wait_timeout 30 in my.cnf

#!/bin/bash
# /etc/init.d/rc.mysql
# This need some variables to be set according to your system
# And it does not yet feature any checks, i.e. if SRV2 MySQL is
# available etc.
# It just basically shall give an idea of what to do

# Make healthcheck fail
$IPTABLES -A INPUT -i lo -p tcp --dport 3306 -j REJECT

# start mysql
/usr/local/mysql/mysql.server start

# start values
READ_POS=1
EXEC_POS=2

$ECHO
$ECHO -n "Waiting for Replication "
while ( test "$READ_POS" != "$EXEC_POS" )
do
	# Pro text string processing :p
	# Be sure to check prpper values for READ_POS
	# and EXEC_POS
	# This worked for my MySQL 5.0.20, though
         POS=`$ECHO "show slave status\G"|$MYSQL $CONNECT_LOCAL|$GREP "Master_Log_Pos"|$GREP -o -E "[0-9]*$"`
         READ_POS=`$ECHO $POS | $CUT -d\  -f1`
         EXEC_POS=`$ECHO $POS | $CUT -d\  -f2`

         $ECHO -n .
	
	# Output every ~10 sec
         if ( test "`$ECHO $(($I%10))`" -eq "0" )
         then
                 $ECHO
                 $ECHO "READ_POS $READ_POS"
                 $ECHO "EXEC_POS $EXEC_POS"
         fi
         I=$(($I+1))
         $SLEEP 1
done

# Make healthcheck succeed
$IPTABLES -D INPUT -i lo -p tcp --dport 3306 -j REJECT

29.11. Failover of large numbers (say 1024) of VIPs

The problem:

If you have a large number of VIPs, they can take a while to failover. I got an e-mail from someone with 200 VIPs whose setup takes 2-3mins to bring up 200 VIPs with ifconfig, on the director which is assuming the master role (and take them down on the one which is assuming the secondary role). I couldn't imagine Horms putting up with this so did some tests to confirm the problem and e-mailed him.

Ratz

No-one should be using ifconfig anymore. You're lucky if the ip address gets set up in the first place when you're in a hurry. So ip addr add... is the key or in linux-ha parlance, IPaddr2.

Christian Bronk chbr (at) webde (dot) de 27 Feb 2006

when they still use ifconfig they have to serialize the startups, because with ifconfig you must have different alias interfaces. If they then send two arp broadcasts with send_arp (with delay of 1s) the complete server takeover will last 3min. To makes this faster they have to rewrite their code to use ip from the iproute2 package and try to startup the IPs in parallel.

Joe

ISPs are quite restrictive about allocating blocks of IPs, so if you have a large number of IPs, they're likely all in the same block. As a test (on a 200MHz machine) I ran a this loop of 4*252.

#!/bin/bash
ethernet=eth1
echo "up"

for j in `seq 0 3`
do
   for i in ` seq 2 254`
   do
     addr=176.0.${j}.${i}
     #ip addr add dev $ethernet ${addr}/32 brd + label ${ethernet}:n${i}ip${j} \
         && /usr/lib/heartbeat/send_arp $ethernet $addr 00:d0:b7:82:b3:c0 176.0.3.255 00:a0:cc:66:22:22 & 
     #ip addr add dev $ethernet ${addr}/32 brd + label ${ethernet}:n${i}ip${j} & 
     ip addr add dev $ethernet ${addr}/32 brd + label ${ethernet}:n${i}ip${j} 
   done
done

echo "down"

for j in `seq 0 3`
do
   for i in ` seq 2 254`
   do
     addr=176.0.${j}.${i}; 
     ip addr del dev $ethernet ${addr}/32 &
   done
done

Here's the times for 1008 addresses on a 200MHz machine (a bit slower than current production directors). Putting the processes into background doesn't help a lot. Presumably forking is as expensive as send_arp.

Table 1. Time,sec to bring 1008 IPs up and down on a 200MHz machine

job type time,sec
ip addr 12
ip addr &amp; 10
ip addr; send_arp 30
ip addr && send_arp & 28

There are two solutions here:

  • William Olson's dynamic routing
  • Horms' method, where he only fails over 1 VIP with heartbeat and lets static routing handle the other VIPs by routing through the failover VIP.

William Olson ntadmin (at) reachone (dot) com 24 Feb 2006

In our previous load balancer configs (scripts then later ldirectord + HA) we experienced the same time lags during failover situations(ex. Stopping heartbeat on the master). Our systems were 700+mhz Dell servers w/at least 512mb ram. They operated in a Master/Backup pair(NAT), each with 2 nics(one for external and one for internal network). Haresources file was used to start and stop ospfd and run IPaddr2 for each of the at least 200 VIPs. You could literally count seconds between each VIP going up/send_arp and the next. We have consequently switched to keepalived which has alleviated this problem.

During a failover while tailing the messages file, you could watch each successive ip addr and send_arp (IPaddr2). Consequently, when a failover would happen, all ips would be brought down on the former master almost instantaneously and slooowly come back up on the backup, now master director. It seemed to me that the issue was being caused by the time it took to actually execute the scripts in the haresources file, as using ip addr and send_arp directly gave time results that were very quick on these same systems.

We're running ospfd on the directors and router(s). It was an original requirement sent down by our network admin to have dynamic routing on all internal routers. These days, it just seems better to go with what has been working rather than to redesign the whole system. We could probably be just as well off without the ospfd part of the picture however, it's working now and true to specification so it's pretty easy for us to troubleshoot.

ospfd seemed like overkill to me when I was originally designing the system, however the dictates of the net admin overrode my input. Now we're operating with an acceptable failover time so I'm inclined to stay with ospfd. It's now working with ospfd running on the directors(always running regardless of director state) and routers with keepalived managing the lvs and failover on the directors.

Initial tests of the new keepalived systems are resulting in 15sec or less failover times independent of the number of IP addresses.

Note

Joe: dynamic routing can take 30-90sec to find new routes: see Dynamic Routing to handle loss of routing in directors.

Horms

The thing that takes time is heartbeat sending out the gratuitous arp. If you combine this idea with a fwmark virtual host (bunching all VIPs into one fwmark), then everything should be quite fast to failover. But even if you don' use a fwmark virtual host, things should be quite snappy up to 1000 addresses or so. (I made that number up :)

The route command used the format for setting a default route, except you route a smaller set of addresses to an alternate place.

Or in other words:

ip route add $VNET/N via $VIP

or

route add $VNET netmask w.x.y.z gw $VIP

Of course you can have a bunch of these statements, if the addresses don't fit nicely into one address block. Though its better to try and avoid having one per address.

Joe

Do you mean "bunch all the VIPs into a single fwmark". If so that was my first thought for a solution, but you can't route from outside to a fwmark, you still need all the IPs to have arp'ed and the router know where to send the packets to. Or did you mean something else?

Horms

"something else". :-)

fwmark is half the solution. It allows LVS to efficiently handle a large number of virtual solutions. But its only half the solution because you still need to get the packets to the linux director. The other half of the solution is routing (Note: the halves can be used by themselves if need be). (For more info on the routing used, see routing to a director without a VIP) With routing you don't need to ARP each individual VIP. You just need to make sure that each box on the local network knows that the VIPs go via the address that is being managed by heartbeat (or other means like keepalived).

Here is an example.

Lets say the network that the ldirectord lives on is 10.130.0.0/24. And lets say you want 1024 virtual IP addresses, say in the block 10.192.0.0-10.195.0.0/22. All you need to do is give heartbeat an address inside 10.130.0.0/24 to manage, say 10.130.0.192, and tell the gateway to route 10.130.0.0/24 via 10.130.0.192 Any host in the network will send packets for 10.192.0.0/23 to the gateway, which will in turn redirect them to 10.130.0.192, which is the linux-director and all is well. Any host outside the network will also end up sending packets via the gateway, and it will duly forward them to 10.130.0.192, again the linux-director gets them and all is well. When a failover occurs, as long as 10.130.0.192 is handled using gratuitous arp (or whatever), then the gateway will know about it, and packets for all of 10.192.0.0/23 will end up on the new linux-director.

If the local network happend to include all the VIPs, say because you had 10.0.0.0/8 on your LAN, then each host would need to know to send 10.192.0.0/23 via 10.130.0.192, which is a bit of a hassle, but still not a particularly big deal.

29.12. Some vrrpd setup instructions

Alexandre CASSEN alexandre (dot) cassen (at) wanadoo (dot) fr 29 Jul 2002

Here is a detailed setup for LVS-HA using a VRRP setup.

1. Topology description

In a "standard" design, when you are playing with a LVS/NAT setup, then you need 2 IP classes. Consider the following sketch :

                 +---------------------+
                 |      Internet       |
                 +---------------------+
                            |
                            |
                       eth0 | 192.168.100.254
                 +---------------------+
                 |       LVS Box       |
                 +---------------------+
                       eth1 | 192.168.200.254
                            |
              --------------+-------------
              |                          |
              | 192.168.200.2            | 192.168.200.3
         +------------+           +------------+
         | Webserver1 |           | Webserver2 |
         +------------+           +------------+

So you have 2 classes defining your both LVS-Box segments : 192.168.100.x for WAN segment and 192.168.200.x for LAN segment.

For the LVS loadbalancing, we want to define a VIP 192.168.100.253 that will loadbalance traffic on both 192.168.200.2 and 192.168.200.3.

For the LVS-Box HA we want to use VRRP setup to use a floating IP to handle director takeover. When playing with LVS-NAT and VRRP, then you need 2 VRRP instances, one for WAN segment and one for LAN segment. To make routing path consitent then we need to define a VRRP synchronization group between this both VRRP instances to be sure that both VRRP instances will have all the time the same state.

2. VRRP Configuration description

vrrp_sync_group G1 {   # must be before vrrp_instance declaration
  group {
    VI_1
    VI_2
  }
}

vrrp_instance VI_1 {
    interface eth0
    state MASTER
    virtual_router_id 51
    priority 100
    authentication {
      auth_type PASS
      auth_pass nenad
    }
    virtual_ipaddress {
        192.168.100.253   # default CIDR mask is /32
    }
}

vrrp_instance VI_2 {
    interface eth1
    state MASTER
    virtual_router_id 52
    priority 100
    authentication {
      auth_type PASS
      auth_pass nenad
    }
    virtual_ipaddress {
        192.168.200.253
    }
}

This configuration will set IP 192.168.100.253 on eth0 and 192.168.200.253 on eth1

3. LVS Configuration description

In order to use HA, we use VRRP VIP as LVS VIP so the LVS configuration will be :

virtual_server 192.168.100.253 80 {
    delay_loop 6
    lb_algo rr
    lb_kind NAT
    persistence_timeout 50
    protocol TCP
			</para><para>
    sorry_server 192.168.200.254 80
			</para><para>
    real_server 192.168.200.2 80 {
        weight 1
        HTTP_GET {
            url {
              path /testurl3/test.jsp
              digest 640205b7b0fc66c1ea91c463fac6334d
            }
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }
			</para><para>
    real_server 192.168.200.3 80 {
        weight 1
        TCP_CHECK {
            connect_timeout 3   # By default connection port is service
port
        }
    }
}

=> VRRP IP 192.168.100.253 will loadbalance traffic to both realservers.

4. Realservers Configuration description

And finally, the only things missing in our configuration is the realservers default gateway... This is why we define a VRRP instance for LAN segment. So

Realservers default gateway MUST be : VRRP VIP LAN segment = 192.168.100.253

5. Keepalived sumup Configuration

! Configuration File for keepalived

global_defs {
   lvs_id lvs01
}

vrrp_sync_group G1 {   # must be before vrrp_instance declaration
  group {
    VI_1
    VI_2
  }
}

vrrp_instance VI_1 {
    interface eth0
    state MASTER
    virtual_router_id 51
    priority 100
    authentication {
      auth_type PASS
      auth_pass nenad
    }
    virtual_ipaddress {
        192.168.100.253   # default CIDR mask is /32
    }
}

vrrp_instance VI_2 {
    interface eth1
    state MASTER
    virtual_router_id 52
    priority 100
    authentication {
      auth_type PASS
      auth_pass nenad
    }
    virtual_ipaddress {
        192.168.200.253
    }
}

virtual_server 192.168.100.253 80 {
    delay_loop 6
    lb_algo rr
    lb_kind NAT
    persistence_timeout 50
    protocol TCP

    sorry_server 192.168.200.254 80

    real_server 192.168.200.2 80 {
        weight 1
        HTTP_GET {
            url {
              path /testurl3/test.jsp
              digest 640205b7b0fc66c1ea91c463fac6334d
            }
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }

    real_server 192.168.200.3 80 {
        weight 1
        TCP_CHECK {
            connect_timeout 3   # By default connection port is service
port
        }
    }
}

6. Keepalived sumup Configuration on BACKUP node

! Configuration File for keepalived
global_defs {
   lvs_id lvs02
}
vrrp_sync_group G1 {   # must be before vrrp_instance declaration
  group {
    VI_1
    VI_2
  }
}
vrrp_instance VI_1 {   # We just change state & priority
    interface eth0
    state BACKUP
    virtual_router_id 51
    priority 50
    authentication {
      auth_type PASS
      auth_pass nenad
    }
    virtual_ipaddress {
        192.168.100.253
    }
}
vrrp_instance VI_2 {
    interface eth1
    state BACKUP
    virtual_router_id 52
    priority 50
    authentication {
      auth_type PASS
      auth_pass nenad
    }
    virtual_ipaddress {
        192.168.200.253
    }
}
virtual_server 192.168.100.253 80 {
    delay_loop 6
    lb_algo rr
    lb_kind NAT
    persistence_timeout 50
    protocol TCP
    sorry_server 192.168.200.254 80
    real_server 192.168.200.2 80 {
        weight 1
        HTTP_GET {
            url {
              path /testurl3/test.jsp
              digest 640205b7b0fc66c1ea91c463fac6334d
            }
            connect_timeout 3
            nb_get_retry 3
            delay_before_retry 3
        }
    }
    real_server 192.168.200.3 80 {
        weight 1
        TCP_CHECK {
            connect_timeout 3   # By default connection port is service port
        }
    }
}

7. LVS-HA scenario

Now run all this on your both director and simulate a crash by unplug the wire on LVS1 eth0 for example.

Detecting this trouble, VRRP will takeover eth0 instance on LVS2 and sync eth1 instance on LVS2. So all traffic will run throught LVS2.

This a typical active/passive scenario.

If you want to extend this configuration to an active/active configuration, then you need to add MASTER VRRP instances on your LVS2. active/active configuration consist of one realserver pool segmentation. This mean that you create 2 realservers pools (in the same IP range) but with a different default gateway that will be the new VRRP LAN VIP. => This part will be described more indepth in the documents I will write soon :)

29.13. Filter rules for vrrpd broadcasts

If you want to filter (allow) the vrrpd broadcasts, here's the recipe

Sebastien BONNET sebastien (dot) bonnet (at) experian (dot) fr

It's PROTOCOL 112 (vrrp), not PORT 112. You also need protocol igmp (don't ask why). You have to allow both incoming and outgoing adverts :

-A INPUT -j ACCEPT -i eth0 -p vrrp -s X.Y.Z.0/24 -d 224.0.0.0/8
-A INPUT -j ACCEPT -i eth0 -p igmp -s X.Y.Z.0/24 -d 224.0.0.0/8

-A OUTPUT -j ACCEPT -o eth0 -p vrrp -s X.Y.Z.0/24 -d 224.0.0.0/8
-A OUTPUT -j ACCEPT -o eth0 -p igmp -s X.Y.Z.0/24 -d 224.0.0.0/8

To be more precise, a tcpdump shows the multicast address is 224.0.0.18 if you want to be more restrictive. Don't forget to allow the trafic needed by keepalived to test your real servers. In my case, it looks like this :

-A INPUT -j ACCEPT -i eth0 -p tcp --dport http  -m state --state NEW

-A OUTPUT -j ACCEPT -o eth0 -p tcp -m state --state NEW --dports http -d 10.11.0.0/16

Noc Phibee

For Vrrp protocol, how should I configure shorewall? When my group changes state I want to restart Shorewall. I have used the notify_*: When my MASTER are dead, the BACKUP change state (good), but when the MASTER are alive and gets the VIP, it runs the same script (restart of shorewall). Anyone have a idea why it doesn't immediately change the states?

Graeme Fowler graeme (at) graemef (dot) net 23 Aug 2006

You must allow packets from/to network 224.0.0.0/8 If you want to control this a bit more accurately, define mcast_src_ip in your keepalived.conf for each defined vrrp_instance, and set your filters accordingly. Firstly it looks like the Master is receiving the announcements from the Backup. This is good. The Backup is also receiving packets from the Master, which is also good - this is why the Backup flip-flops from BACKUP to MASTER to BACKUP state continuously.

However - something else is happening here, and I expect it's your Shorewall config. Ignoring the Master machine for a moment, let me put forward a possible reason:

The Backup machine starts up, brings up keepalived, and goes into BACKUP state. Shorewall is dropping packets at this point, so the Backup machine goes to MASTER state, does things to Shorewall with the notify script, and starts to accept packets. It then receives an advertisement from the Master director, so it switches to BACKUP state, changes the Shorewall config back, misses advertisement, switches to MASTER, changes the firewall, misses advertisement, etc etc.

Assuming this is correct, there are several things you need to do:

  • Make sure the Shorewall config isn't dropping the packets you want (see the suggestions above).
  • Put your notify* script actions into your vrrp_sync_group block instead of the vrrp_instance. That way it'll only fire once, when the group changes state, rather than one being fired off for every instance state change *and* the group.

Graeme Fowler graeme (at) graemef (dot) net 09 Oct 2006

You need

iptables -I INPUT -d 224.0.0.0/8 -j ACCEPT

You need to explicitly accept multicast for this to work. You can make it more accurate by setting the appropriate config option in your keepalived config to set the mcast_src_address, and then have a corresponding rule to let that in.

29.14. Vinnie's comparison between ldirectord/heartbeat and keepalived/vrrpd

Vinnie listacct1 (at) lvwnet (dot) com 26 Apr 2003

I set up both ldirectord and keepalived up to try them out and see which I liked better. The goal was to have a redundant pair of LVS(-NAT) directors which would also serve as the primary firewall/gateway for our external connection and DMZ hosts. My firewall scripts use heavy stateful inspection, iproute2 ip utils to add/remove ip's and routes, and proxy arp. I did not want to lose any of the features of the firewall setup if at all possible.

These are my observations, but as others have said, both are being heavily developed by their authors, and anything I say here could be obsolete in short time. I think they are both great packages.

  • LDIRECTORD/HEARTBEAT

    Using ldirectord, you need heartbeat to handle the failover/redundancy capabilities. The modular approach is a good idea, and they have developed resource monitoring which goes well beyond just checking if a realserver responds to a connection/request on a certain port/service.

    I set up ldirectord first, on a single director, since this would get the high availability of my realservers working. I used my firewall script to add the VIP's to the director (with iproute2 ip commands), and added a section to the script to create the virtual services with ipvsadm commands. (I knew this wouldn't be necessary when/if I got heartbeat set up).

    Ldirectord works quite well, and it apparently has the ability to do a basic UDP check (since stopping named on one of the realservers causes ldirectord to remove that RS from the 53/udp virtual service until named is started back up).

    I was disappointed to see that heartbeat (particularly the ipfail part) is still written to use ifconfig and old-way ip "aliases" (ie, eth0:0, eth0:1, etc.) to have multiple IP's on an interface. 2.2.x and ipchains is a bit "passe" - 2.4.x and iptables has been stable/production for quite some time now. iptables does not like interface names with ":" in them, so you are imposing pretty stiff limits on what kinds of firewall rule sets you can write if you use old-style ip aliases.

    Reading the mailing list archives, it looks like some users have started submitting patches that will cause heartbeat to use the iproute utils to set up the interfaces instead, but this had not been incorporated into the latest-available beta (at that time, 1st half of Apr. '03), and I was not sure if the modifications were comprehensive in scope, or just addressing certain aspects. (My programming skills are pretty weak).

    Note
    Joe, Dec 2003, Linux-HA has been rewritten to use the Policy Routing tools.

    This pretty much nixed any further looking into of heartbeat for me, so I started looking at keepalived.

  • KEEPALIVED

    Keepalived is an all-in-one package, which is written in C.

    It uses 2.4-native netlink functions to set up interfaces, IP addresses, and routes on 2.4 boxes, so it is no big deal to have multiple IP's on one interface. You can use iptables commands which match a single interface to cover all the IP's on the interface, or you can add a -d one.of.my.vips to make that rule match a single VIP, subnet, etc.

    Keepalived uses VRRPv2 to handle director failover, it's really nice. When failover happens, the new master sends gratuitous arps out on the network, so virtual services experience basically no interruption (especially since keepalived also supports IPVS connection synchronization between directors).

    There is currently an issue with HEAVY syslog activity when a pair of directors are running (it logs the election process on both directors) but Alexandre is working on that.

    If you're running proxy-arp on the director, you can use keepalived's ability to run scripts when a machine becomes master to send arping unsolicited arps for the other hosts in your DMZ and your ISP's gateway, so that the other hosts in your DMZ with routable IP's on their interfaces (which only need the director as a router/firewall) are also updated with the new master's MAC addresses. Keepalived doesn't send gratuitous arp's out for IP's it didn't take over, so this is needed for your DMZ hosts to see the ISP's gateway, and also for the ISP's gateway to be apprised of the new (remember, proxy arp!) MAC address of those other DMZ hosts. (I am currently working on a HOW-TO for this, which will apparently be added to the keepalived online documentation, and also our website).

    Keepalived currently does not support UDP connection check, but it is on Alexandre's to-do list.

    Also another feature (I haven't looked at yet) of keepalived is the virtual_server_group capability, which allows you to group virtual services together and have their health check pass/fail determined by a single connection check - good for example if you have a stack of IP-based apache virtual hosts on a realserver. You probably don't really need to check each virtual host's IP, and you don't want to flood the realserver with health checks.

I think they're both really great packages. If heartbeat were updated to use iproute2 utils, instead of ifconfig and interface aliases to have multiple IP's per interface, it would be much more viable for people running strong iptables firewall rulesets, such as those who wish to use the director as a firewall/gateway.

Me personally, I'm going to keep running keepalived. It also has a lower CPU overhead.

29.15. Saru: All directors active at the same time

Horms has written code allowing all directors to be active at the same time and an improved syncd. The code is at http://www.ultramonkey.org/papers/active_active/.

Horms 18 Feb 2004

Saru means monkey in Japanese. It's the work I did in Japan on Ultra Monkey... well some of it anyway. The kanji is in the original paper I wrote for Saru (http://www.ultramonkey.org/papers/active_active/active_active.shtml). Here are the google links for looking up the meaning of "saru" in Japansese

http://www.google.co.jp/search?hl=ja&ie=UTF-8&oe=UTF-8&q=%E3%82%B5%E3%83%AB&btnG=Google+%E6%A4%9C%E7%B4%A2&lr=
http://www.google.co.jp/search?q=%E7%8C%BF&ie=UTF-8&oe=UTF-8&hl=ja&btnG=Google+%E6%A4%9C%E7%B4%A2&lr=

Horms horms (at) verge (dot) net (dot) au 16 Feb 2004

Saru has nothing to do with connection synchronisation, which is what syncd does. Saru provides a mechanism to allow you to have Active-Active Linux Directors. Syncd (and other synchronisation daemons) synchronise connections, allowing them to continue even if the Linux Director that is handling them fails (assuming that there is another Linux Director available for them to fail-over to). If you are using Saru then you probably want to use connection synchronisation, but the reverse is not necessarily true.

This paper (http://www.ultramonkey.org/papers/lvs_jan_2004/) should cover how Connection Syncronisation and Active-Active work together. It briefly covers the relevant parts of Alexanre's syncd patches and how they interact with Saru (as I understand the extensions anyway).

Here's a short explanation of Saru's working.

Francisco Gimeno kikov (at) kikov (dot) org 11 Nov 2006

The directors:

  • All director nodes know about eachother
  • Each director has an ID ( for example, the MAC or the IP )
  • Each director node can elaborate a sorted list based on that ID
  • Heartbeat everywhere, so the list is dynamic list
  • The "view" of each node should be the same for each node (ie: all nodes should have the same list)
  • They should have a virtual-MAC. Those requirements could be satisfied with a broadcast sync protocol (it could be similar to the WCCP, for example)

for each arriving packet

  • Make a HASH with the parameters you want to keep the __affinity__ (like src IP, dst IP, ports, ...) to.
  • Calculate (HASH % numer_of_nodes) ( % := modulo )
  • If that value it's the order in the list for the node processing the packet, the packet is accepted, if not, discarded. As every packet go to every director...

So one of the most important thing here, is that no director has to put the virtual-MAC in the wire, as every director has to receive the packet. Arp responses to the VIP should be the virtual-MAC, but it should be sent with a bogus-MAC. With that, the responsible to route packets to the VIP, will send the packets to that virtual-MAC. As the switch (L2) don't know the physical port associated to it, it sends the packet to all the active ports that hasn't a MAC associated which are the director's. If you use a HUB then, there will not be this kind of problems (who ownes a HUB nowadays?).

Horms 13 Nov 2006

It is only active-active for the linux-directors, and its not really supposed to be active-active for a given connection, just for a given virtual service. So different connections for the same virtual service may be handled by different linux-directors.

The real trick, is that it isn't a trick at all. LVS doesn't terminate connections, it just forwards packets like a router. So it needs to know very little about the state of TCP (or other) connections. In fact, all it really needs to know is already handled by the ipvs_sync code, and that mainly just a matter of association connections with real servers. Or in other words, the tuple enduser:port,virtual:port,real:port.

Ratz

I've read it now and I must say that you've pulled a nice trick :). I can envision that this technique works very well in the range of 1-2 Gbit/s for up to 4 or so directors. For higher throughput netfilter and the time delta between saru updating the quorum and the effective rule being in place synchronised on all nodes might exceed the packet arrival interval. We/I could do a calculation if you're interested, based on packet size and arrival on n-Gbit/s switched network. You're setting rules for ESTABLISHED in your code to accept packets by lookup of the netfilter connection tracking and while the kernel 2.4 does not care much about window size and other TCP related settings, 2.6 will simply drop the in-flight TCP connection that is suddenly sent to a new host. There are two solutions to overcome this problem for 2.6 kernel. One is fiddling ip_conntrack_tcp_be_liberal, ip_conntrack_tcp_loose and sometimes ip_conntrack_tcp_max_retrans and the other is checking out Pablo's work on netfilter connection tracking synchronisation.

29.16. Server Load Balancing Registration Protocol

William V. Wollman wwollman (at) mitre (dot) org

We have worked with LVS to implement something we call the Server Load Balancing Registration Protocol. It allows a server to plug into a network with the LVS and register it services with the LVS. The LVS is then automatically configured to balance the services registered. We did not modify LVS directly but created a Java program that processes the realserver registration messages and then automatically configures the LVS. The realserver requires the installation of a Java program to build the registration messages and register its services. One benefit is plug and play SLB with minimal administration. Another benefit is controlled configuration management.

A paper is available for download that describes the work we completed in a bit more detail. If interested please download and read the article at http://www.mitre.org/work/tech_papers/tech_papers_03/wollman_balancing/index.html.

29.17. using iproute2 to keep demons running during failover, while link is down

On failover, when the backup machine assumes the active role, it must bring up IP(s) and possibly start demons listening on those IP(s). With the Policy Routing tools e.g. iproute2, the backup machine can have the IP configured with the demons listening, but with the link state down. On failover you just change the linkstate to "up".

Roberto Nibali ratz (at) drugphishi (dot) ch 24 Dec 2003

This will keep all assigned IP addresses for the interface.

ip link set dev $intf down

This will remove (flush) all IP addresses from the interface.

ip addr flush dev $intf

However

ifconfig $intf down

means:

set link state of eth1 down _and_ flush the IP address of the alias named $intf.

Which is completely broken! With ifconfig you have no means to distinguish between flushing IP addresses and setting a link state of a physical interface. There's a huge difference routing wise. In the case of setting the physical link layer to down, you do _not_ disable routing table entries. In the case of flushing an IP address you _also_ remove its routing table entry which can be annoying from a setup point of view and definitely irritating from a security viewpoint.

The reason why it is important to have two states of interface setup can for example be found in the security business. You set the link state to down, set up all packet filter rules and then configure all IP addresses and rules and routes. Then you start local daemons (and they will start even if they need to bind and listen to non-local IP addresses because the IP addresses and the routing is complete) _and_ after that you open your gates by setting the link state to up.