2. What is an LVS? Can I use an LVS?

A Linux Virtual Server (LVS) is a cluster of servers which appears to be one server to an outside client. This apparent single server is called here a "virtual server". The individual servers (realservers) are under the control of a director (or load balancer), which runs a Linux kernel patched to include the ipvs code. The ipvs code running on the director is the essential feature of LVS. Other user level code is used to manage the LVS (set rules for services handled, handle failover). The director is basically a layer 4 router with a modified set of routing rules (i.e. connections do not originate or terminate on the director, it doesn't send ACKs etc, it's just a router).

When a new connection is requested from a client to a service provided by the LVS (e.g. httpd), the director will choose a realserver for the client. From then, all packets from the client will go through the director to that particular realserver. The association between the client and the realserver will last for only the life of the tcp connection (or udp exchange). For the next tcp connection, the director will choose a new realserver (which may or may not be the same as the first realserver). Thus a web browser connecting to an LVS serving a webpage consisting of several hits (images, html page), may get each hit from a separate realserver.

Since the director will send the client to an arbitary realserver, the services must be either read only (e.g. web services) or if read/write (e.g. an on-line shopping cart) some mechanism external to LVS must be provided for propagating the writes to the other realservers on a timescale appropriate for the service (i.e. purchase of an item must decrement the stock on all other nodes before the next client attempts to purchase the same item). At best LVS is read mostly.

If you just want one of several nodes to be up at any one time, and the other node(s) to become active on failure of the primary node, then you don't need LVS: you need a high availability setup e.g. Linux-HA (heartbeat), vrrp or carp.

If you want independant servers at different locations, then you want a geographically distributed server like Supersparrow.

Here are some rrd images showing load balancing

Management of the LVS is through the user space utility ipvsadm and schedulers, which is used to add/removed realservers/services and to handle failout. LVS itself does not detect failure conditions; these are detected by external agents, which then update the state of the LVS through ipvsadm

2.1. What is a VIP?

The director presents an IP called the Virtual IP (VIP) to clients. (When using firewall mark (fwmark), VIPs are agregated into groups of IPs, but the same principles apply as for a single IP). When a client connects to the VIP, the director forwards the client's packets to one particular realserver for the duration of the client's connection to the LVS. This connection is chosen and managed by the director. The realservers serve services (eg ftp, http, dns, telnet, nntp, smtp) such as are found in /etc/services or inetd.conf. The LVS presents one IP on the director (the virtual IP, VIP) to clients.

Peter Martin p (dot) martin (at) ies (dot) uk (dot) com and John Cronin jsc3 (at) havoc (dot) gtf (dot) org 05 Jul 2001

The VIP is the address which you want to load balance i.e. the address of your website. The VIP is usually an alias (e.g. eth0:1) so that the VIP can be swapped between two directors if a fault is detected on one.

The VIP is the IP address of the "service", not the IP address of any of the particular systems used in providing the service (ie the director and the realservers).

The VIP be moved from one director to another backup director if a fault is directed (typically this is done by using mon and heartbeat, or something similar). The director can have multiple VIPs. Each VIP can have one or more services associated with it e.g. you could have HTTP/HTTPS balanced using one VIP, and FTP service (or whatever) balanced using another VIP, and calls to these VIPs can be answered by the same or different realservers.

Groups of VIPs and/or ports can be setup with firewall mark (fwmark).

The realservers have to be configured to work with the VIPs on the director (this includes handling the The Arp Problem).

There can be persistent connection issues, if you are using cookies or https, or anything else that expects the realserver fulfilling the requests to have some connection state information. This is also addressed on the LVS persistence page

2.2. Where do you use an LVS?

  • For higher throughput. The cost of increasing throughput by adding realservers in an LVS increases linearly, whereas the cost of increased throughput by buying a larger single machine increases faster than linearly
  • for redundancy. Individual machines can be switched out of the LVS, upgraded and brought back on line without interuption of service to the clients. Machines can move to a new site and brought on line one at a time while machines are removed from the old site, without interruption of service to the clients.
  • for adaptability. If the throughput is expected to change gradually (as a business builds up), or quickly (for an event), the number of servers can be increased (and then decreased) transparently to the clients.

2.3. Client/Server relationship is preserved in an LVS

  • Client sees only one IP address and believes it is connecting to a single machine. IPs of all servers is mapped to one IP (the VIP). While the client is connected to only one machine at a time, however subsequent connections will be assigned to a new and likely different machine.
  • servers at different IP addresses believe they are contacted directly by the client.

2.4. LVS director is an L4 switch

In the computer beastiary, the director is a layer 4 (L4) switch. The director makes decisions at the IP layer and just sees a stream of packets going between the client and the realservers. In particular an L4 switch makes decisions based on the IP information in the headers of the packets.

Here's a description of an L4 switch from Super Sparrow Global Load Balancer documentation

Layer 4 Switching: Determining the path of packets based on information available at layer 4 of the OSI 7 layer protocol stack. In the context of the Internet, this implies that the IP address and port are available as is the underlying protocol, TCP/IP or UCP/IP. This is used to effect load balancing by keeping an affinity for a client to a particular server for the duration of a connection.

This is all fine except

Nevo Hed nevo (at) aviancommunications (dot) com 13 Jun 2001

The IP layer is L3.

Alright, I lied. TCPIP is a 4 layer protocol and these layers do not map well onto the 7 layers of the OSI model. (As far as I can tell the 7 layer OSI model is only used to torture students in classes.) It seems that everyone has agreed to pretend that tcpip uses the OSI model and that tcpip devices like the LVS director should therefore be named according to the OSI model. Because of this, the name "L4 switch" really isn't correct, but we all use it anyhow.

The director does not inspect the content of the packets and cannot make decisions based on the content of the packets (e.g. if the packet contains a cookie, the director doesn't know about it and doesn't care). The director doesn't know anything about the application generating the packets or what the application is doing. Because the director does not inspect the content of the packets (layer 7, L7) it is not capable of session management or providing service based on packet content. L7 capability would be a useful feature for LVS and perhaps this will be developed in the future (preliminary ktcpvs code is out - May 2001 - L7 Switch).

The director is basically a router, with routing tables set up for the LVS function. These tables allow the director to forward packets to realservers for services that are being LVS'ed. If http (port 80) is a service that is being LVS'ed then the director will forward those packets. The director does not have a socket listener on VIP:80 (i.e. netstat won't see a listener).

John Cronin jsc3 (at) havoc (dot) gtf (dot) org (19 Oct 2000) calls these types of servers (i.e. lots of little boxes appearing to be one machine) "RAILS" (Redundant Arrays of Inexpensive Linux|Little|Lightweight|L* Servers). Lorn Kay lorn_kay (at) hotmail (dot) com calls them RAICs (C=computer), pronounced "rake".

2.5. LVS forwards packets to realservers

The director uses 3 different methods of forwarding.

  • LVS-NAT based on network address translation (NAT)
  • LVS-DR (direct routing) where the MAC addresses on the packet are changed and the packet forwarded to the realserver
  • LVS-Tun (tunnelling) where the packet is IPIP encapsulated and forwarded to the realserver.

Some modification of the realserver's ifconfig and routing tables will be needed for LVS-DR and LVS-Tun forwarding. For LVS-NAT the realservers only need a functioning tcpip stack (i.e. the realserver can be a networked printer).

LVS works with all services tested so far (single and 2 port services) except that LVS-DR and LVS-Tun cannot work with services that initiate connects from the realservers (so far; identd and rsh).

The realservers can be indentical, presenting the same service (eg http, ftp) working off file systems which are kept in sync for content. This type of LVS increases the number of clients able to be served. Or the realservers can be different, presenting a range of services from machines with different services or operating systems, enabling the virtual server to present a total set of services not available on any one server. The realservers can be local/remote, running Linux (any kernel) or other OS's. Some methods for setting up an LVS have fast packet handling (eg LVS-DR which is good for http and ftp) while others are easier to setup (eg transparent proxy) but have slower packet throughput. In the latter case, if the service is CPU or I/O bound, the slower packet throughput may not be a problem.

For any one service (eg httpd at port 80) all the realservers must present identical content since the client could be connected to any one of them and over many connections/reconnections, will cycle through the realservers. Thus if the LVS is providing access to a farm of web, database, file or mail servers, all realservers must have identical files/content. You cannot split up a database amongst the realservers and access pieces of it with LVS.

The simplest LVS to setup involved clients doing read-only fetches (e.g. a webfarm). If the client is allowed to write to the LVS (e.g. database, mail farm), then some method is required so that data written on one realserver is transferred to other realservers before the client disconnects and reconnects again. This need not be all that fast (you can tell them that their mail won't be updated for 10mins), but the simplest (and most expensive) is for the mail farm to have a common file system for all servers. For a database, the realservers can be running database clients which connect to a single backend database, or else the realservers can be running independant database daemons which replicate their data.

2.6. LVS runs on Linux and FreeBSD directors

LVS was developed on Linux and historically uses a Linux director. The Intel and Dec Alpha versions of LVS are known to work. The LVS code doesn't have any Intel specific instructions and is expected to work on any machine that runs Linux.

In Apr 2005, LVS was ported to FreeBSD by Li Wang

Li Wang dragonfly (at) linux-vs (dot) org 2005/04/16

The URL is: FreeBSD port of LVS (http://dragon.linux-vs.org/~dragonfly/htm/lvs_freebsd.htm). Here's a performance test on FreeBSD(version 0.4.0) (http://dragon.linux-vs.org/~dragonfly/software/doc/ipvs_freebsd/performance.html).

2.7. Code for LVS is different for each kernel series

There are differences in the coding for LVS for the 2.0.x, 2.2.x, 2.4.x and 2.6.x kernels. Development of LVS on 2.0.36 kernels has stopped (May 99). Code for 2.6.x kernels is relatively new.

The 2.0.x and 2.2.x code is based on the masquerading code. Even if you don't explicitely use ipchains (eg with LVS-DR or LVS-Tun), you will see masquerading entries with `ipchains -M -L` (or `netstat -M`).

Code for 2.4.x kernels was rewritten to be compatible with the netfilter code (i.e. its entries will show up in netfilter tables). It is now production level code. Because of incompatibilities with LVS-NAT for 2.4.x LVS was in development mode (till about Jan 2001) for LVS-NAT.

2.8. kernels from 2.4.x series are SMP for kernel code

2.4.x kernels are SMP for kernel code as well as user space code, while 2.2.x kernels are only SMP for user space code. LVS is all kernel code. A dual CPU director running a 2.4.x kernel should be able to push packets at twice the rate of the same machine running a 2.2 kernel (if other resources on the director don't become limiting). (Also see the section on SMP doesn't help.)

2.9. OS for realservers

You can have almost any OS on the realservers (all are expected to work, but we haven't tried them all yet). The realservers only need a tcpip stack - a networked printer can be a realserver.

2.10. LVS works on ethernet

LVS works on ethernet.

There are some limitations on using ATM.

Firewire: (from the Beowulf mailing list - Donald Becket 5 Dec 2002): The firewire transport layer (IEEE1394) does run IP over FireWire. However firewire is designed for fixed size repeated frames (video or continuous disk block reads), but has overhead for other communication. Throughput is 400Mbps but worst case latency is high (msec range).

Oracle has released GPL libraries for clustering Linux boxes over FireWire (http://www.ultraviolet.org/mail-archives/beowulf.2002/2977.html, link dead Dec 2003).

2.11. LVS works on IPv6

Seiji Tsuchiike tsuchiike (at) yggr-drasill (dot) com 02 Jun 2002

We just implemented IPv6 to lvs. We think that Basic Mechanism is same. (http://www.yggr-drasill.com/LVS6/documents.html. link dead Dec 2003, but Sep 2004 Horms says its alive; Joe Dec 2006 it's alive).

2.12. LVS is continually being developed

LVS is continually being developed and usually only the more recent kernel and kernel patches are supported. Usually development is incremental, but with the 2.2.14 kernels the entries in the /proc file system changed and all subsequent 2.2.x versions were incompatible with previous versions.

2.13. LVS is 64 bit

Kenny Chamber

Has anybody here successfully setup lvs-director on sparc64 machine? I need to know which distro is OK for this.

Ratz 16 Dec 2004

Yes. Just recently. Debian is fine, I reckon Gentoo would do as well.

INFO: It could be that your ipvsadm binary that comes to instrument the kernel tables for LVS is broken with regard to 64bit'ness. You then need to download the latest sources and recompile adding '-m64' to the CFLAGS. That's all, other than that it seems to work nicely.

Btw: I took Debian testing, probably not too wise but on the other hand I needed more up to date tools. I wouldn't know of too many other Distros that have up to date Sparc64 support. Suse used to have, but they dropped support a while ago unfortunately.

Justin Ossevoort justin (at) snt (dot) utwente (dot) nl 16 Dec 2004

Well our plain debian-sarge here did it just as painlessly as our x86 based machines. So as long as your distro has ipvs (and of course a sparc tree ;)) support you're in the green.

liuah

I want to know whether LVS can work with 64-bit boxes. If I use LVS-DR, how can I apply the hidden patch to 64-bit linux, using kernel is 2.4.18?

ratz 29 Nov 2003

Yes. The only problem I see is if either the counters or the hashtable handling has some bug with 32/64-bit signedness and wrong shift operators. Just let us know if you experience flakyness on your director ;). The hidden patch for your kernel is: http://www.ssi.bg/~ja/hidden-2.4.19pre5-1.diff I hope you are aware of the fact that 2.4.18 is really buggy in many ways. I know that some 64-bit archs have been lagging behind in the 2.4.x tree but if I was you I would upgrade to a newer kernel.

Peter Mueller pmueller (at) sidestep (dot) com 29 Nov 2003

The one for straight 2.4.18 is http://www.ssi.bg/~ja/hidden-2.4.5-1.diff. Since he said 2.4.18 I would suspect he's running Debian. If you want a Debian kernel with LVS+hidden use the Ultramonkey kernel (http://www.ultramonkey.org/").

liuah liuah (at) langchaobj (dot) com (dot) cn 02 Dec 2003

The hidden patch compiles and runs on our 64-bit servers successfully.

2.14. Other documentation

For more documentation, look at the LVS web site (eg a talk I gave on how LVS works on 2.0.36 kernel directors)

Julian has written Netparse for which we don't have a lot of documentation yet.

For those who want more understanding of netfilter/iptables etc, here are some starting places. These topics are also covered in many other places.

2.15. LVS is not simple to install, get going or keep running

This is not a utility where you run ../configure && make && make check && make install, put a few values in a *.conf file and you're done. LVS rearranges the way IP works so that a router and server (here called director and realserver), reply to a client's IP packets as if they were one machine. You will spend many days, weeks, months figuring out how it works. LVS is a lifestyle, not a utility.

That said, you should be able to get a simple LVS-NAT setup working in a few hours without really understanding a whole lot about what's going one (see the LVS-mini-HOWTO).

2.16. LVS Control (Failure, Thundering Herd, Sorry Servers)

LVS is kernel code (ip_vs) and a user space controller (ipvsadm). When adding functionality to LVS (handling failover, bringing new machines on-line), where does it go? In the kernel code or in the user space code? Such decisions are relevant if you can choose from two equally functional lots of code - we usually get what the coder wanted to implement.

Current thinking is to make the kernel code just handle the switching and to have all control in user space.

Should the thundering herd problem be controlled by LVS or by an external user space program e.g. feedbackd. Currently there is both a kernel patch and you implement a script to change from rr to lc shortly after adding a new realserver.

LVS supplies high throughput using multiple identically configured machines. You would like to be able to swap out machines for planned maintenance and to automatically handle node failure (high availability).

The LVS itself does not provide high availability. The current thinking is that the software layer that provides high availability should be logically separate to the layer that it monitors. The writing of software that attempts to determine whether a machine is working, is somewhat of a black art. There are several packages used to help provide high availability for LVS and these are discussed in the High Availability LVS section.

While it is relatively easy to monitor the functionality of the realservers, fail-out of directors is more difficult. An even greater problem is handling failure of nodes which are holding state information.

Note
There is a sorry server option in Section 29.6

Gustavo

I'm trying to create a sorry server for clients that can't connect to my real servers (limited by u-threshold); ServerA - 100 conn, ServerB - 110 conn. When this limit is reached I want my clients to go to a lighttpd served page saying "come back later" I'm trying with weights and thresholds... but it's not working the way I thought.

Ratz 22 Nov 2006

I suspect the clients scheduled for the sorry server never return back to the cluster, right (only if you use persistency of course)?

That's right. I'm working on a project for an airline companie. Some times they post some promotional tickets for a small period of time (only for passengers that buys on the website can have it) and the servers go high.

I've written a patch for the 2.4 kernel series extending IPVS to support the concept of an atomically switching sorry server environment. Unfortunately I didn't have the time to port the work to 2.6 kernels yet (the threshold stuff is already in but a bit broken and the sorry server stuff needs some adjustments in the 2.6 kernel). If you run 2.4 on your LB, you could try out the patches posted to this list almost exactly one year ago:

http://marc.theaimsgroup.com/?l=linux-virtual-server&m=113225125532426&w=2
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=113225142406014&w=2

The fix to the kernel patch above:

http://marc.theaimsgroup.com/?l=linux-virtual-server&m=113802828120122&w=2

And the 3/4 cut-off fix:

http://www.in-addr.de/pipermail/lvs-users/2005-December/015806.html

I personally believe that the sorry-server feature is a big missing piece of framework in IPVS, one that is implemented in all commercial HW load balancers.

Horms

That is true, but its also a piece that is trivially inplemented in user-space, where higher-level monitoring is usually taking place anyway. Is there a strong argument for having it in the kernel?

ratz 14 Feb 2007

Yes, it won't work reliably in user space because of missing atomicity. From the point the user space daemon decides that it's time to switch over to the sorry-server pool to the actual switch in the kernel by modifying the according service flag, there's a couple of us to ms time frame in which the kernel TCP stack will happily proceed with its normal tasks, including service more requests to the previously elected service for sorry-server forwarding. This can lead to broken (half-shown) page views an the customer's side inside their browser.

In the field I had to implement load balancing, this was simply not accepted, especially because it irritated our customer's clients and also because everybody knew that HW load balancers do it right (tm).

YMMV and I still didn't sit down and forward port my code to 2.6 but I first need some interest by enough people before I start :).

I wrote the 2.4 server pool implementation for a ticket reseller company that probably had the same problems as your airline company. Normal selling activities not needing high end web servers and then from time to time (in your case promotional tickets, in my case Christina Aguilera, U2 or Robbie Williams or World Soccer Championship tickets) peak selling where tickets need to be sold in the first 15 minutes having tens of thousands of requests per second, plus the illicit traffic generated by scripters trying to sanction the event. These peaks, however, do not warrant the acquisition of high-end servers and on-demand servers cannot be organized/prepared so quickly.

I need to manually limit each server capacity and the remaining connections need to go to this sorry server.

That's exactly the purpose of my patch, plus you get to see how many connections (persistent as in session and active/passive connections) are forwarded to either the normal webservers (so long as they are within the u_thresh and l_thresh) or the overflow (sorry server) pool. As soon as one of the RS in the serving pool drops below l_thresh future connection requests are immediately sent to the service pool again.

We have tried F5 Big-IP for a while and it worked perfectly, but it is very expensive for us :(

Yep, about USD 20k-30k to have them in a HA-pair.

So for the 2.4 kernel, I have a patch that has been tested extensively and is running in production for one year now, having survived some hype events. I don't know if I find time to sit down for a 2.6 version. Anyway, as has been suggested, you can also try the sorry server of keepalived, however I'm quite sure that this is not atomically (since keepalived is user space) and works more like:

while true {
  for all RS {
    if RS.conns > u_thresh then quiesce RS
    if RS.isQuiesced and RS.conns < l_tresh then {
       if sorry server active then remove sorry server
       set RS.weight to old RS.weight
  }
  if sum_weight of all RS == 0 then invoke sorry server with weight > 0
}

If this is the case, it will not work for our use cases with high peak requests, since sessions are not switched to either one service pool atomically and thus this will result in people being sent to the overflow pool even though they would have had a legitimate session and others again get broken pages back, because in midst of the page view the LB's user space process gets a scheduler call to update its FSM and so further requests sent for HTTP 1.0 for example will be broken. The browser hangs on your customers side and your management gets the angry phone calls of the business users, to whom you had promised B2B access.

This is roughly how I came around to implementing the server overflow (spillover server, sorry server) functionality for IPVS.