21. LVS: Persistent Connection (Persistence, Affinity in cisco-speak)

21. LVS: Persistent Connection (Persistence, Affinity in cisco-speak)
Prev		Next

	Note
	Apr 2006: No-one has tried this, but it seems that the -SH scheduler could replace persistence, without the failover problems of persistence. The -SH scheduler schedules according to the client IP, meaning that all of a client's connection requests will be sent to the same RIP. The -SH scheduler has been around for a while, but it seems that no-one has known what it did. One of the problems was that no-one knew how to use the weight parameter.

	Note
	Sep 2002: Rewritten. All references to the LVS persistence used in kernels <2.2.12 have been dropped.

(For another writeup on persistence, see LVS persistence page .)

For LVS, the term "persistence" has 2 meanings.

"persistent connection" a term used for clients connecting to webservers and databases. For Apache this is called a "keepalive" and is set in httpd.conf and is described in persistent_http.
"persistent connection" used in LVS. LVS persistence is closer to the concept of "affinity" as used by cisco

The two types of persistence are quite different. Unfortunately, both features are persistent and can reasonably claim the name "persistent". This causes some confusion in nomenclature. LVS persistence could alternately be described as connection affinity or port affinity.

LVS persistence directs all (tcpip) connection requests from the client to one particular realserver. Each new (tcpip) connection request from the client resets a timeout (time set by the -p option of ipvsadm) LVS persistence has been part of LVS for quite a while (first implementation by Pete Kese, when it was called pcc) and was added to handle ssl connections, squids and multiport connections like ftp (squids now have their own scheduler).

(LVS) persistence is also used when the realserver must maintains state (i.e. when the client sends information to the realserver in shopping carts, or writing to an application such as a database, or the client must hold a cookie).

Persistence has the following effects

Because the client is connected to one server for the session, applications like shopping carts and databases don't have to be rewritten when moving services to an LVS. The shopping cart can accumulate information just as it does when it runs on a standalone server. This is the original reason for the persistence feature.
Because the client's data is only on one realserver, the data must be propagated to the other realservers before anyone else needing that information connects.
There is no load balancing for the client, i.e. all connections are sent to one realserver. This is not a big problem, since a heavily loaded server will be handling thousands of connections at any one time, and whether one client has all its connections to one realserver or to many doesn't make much difference to anyone.
The difficulty of handling failover is the worst side effect of using persistence. As initially implemented, if a realserver crashed, or was brought down for maintenance (by setting the weight to zero), the director would still send connections from the client to the same realserver, until the timeout expired. In the case of a crashed realserver, the client wouldn't get a connection and its session would be hung (or get a tcpip reset). In the case of bringing a machine down for maintenance, the administrator would have to wait till all the clients finished their session.
Because of the hung connections on realserver failure with persistent connection, in late 2004, Horms changed the behaviour on bringing down a persistent connection so that it was the same as for non-persistent connection, i.e. the no new tcpip connections would be allowed. The problem now is that data written to the crashed realserver is lost. The client will make their next click expecting to move to the next screen, only to find that the realserver has no idea who they are. Presumably the application will have to be written to behave sensibly with this client ("we're sorry, your connection is important to us - who are you and why are you connecting?"). At least this behaviour is better than a hung connection in the middle of a session.
Note
Most of the discussion below describes the old behaviour on bringing down a persistent service.

	Note
Most of the discussion below describes the old behaviour on bringing down a persistent service.

You should understand the consequences of using persistence if you plan to use it in production. The ideal approach from a theoretical point of view is to rewrite the application so that data is propagated to all realservers immediately (or at least before the client initiates a new SSL session), allowing the LVS to run in non-persistent mode. Rewriting your application is difficult, but if you're in production with a secure (SSL) site, you're already spending money. Despite us using every opportunity to exhort people to rewrite their applications, we find that most people don't and continue to use persistence.

Alternatives to persistence include

ftp - move the ftp server out of the LVS - ftp is difficult to secure
squids - use the dh scheduler, which is designed for squids and which was developed to overcome the deficiencies of using persistence for squids. See Note in the DH section: Jezz Palmer found that he got better performance for squids with persistence.
using persistent firewall mark (fwmark) to forward multiport services as found in an e-commerce site (i.e. 80, 443).
rewriting your e-commerce application in an L4 friendly manner (i.e. to save state on the realserver).

21.1. LVS persistence

LVS persistence makes a client connect to the same realserver for different tcpip connections. The LVS persistant connection is at the layer 4 protocol level.

LVS persistence is rarely needed and has some pitfalls (as explained below). It's useful when state must be maintained on the realserver, e.g. for https key exchanges, where the session keys are held on the realserver and the client must always reconnect with that realserver to maintain the session.

LVS persistence has two consequences

A client making a new tcpip connection, within the timeout period (usually 5-10mins), will be sent to the same realserver as on the previous connection. The new tcp connection will reset the timer. A connect request made past the timeout period will be treated as a new connection and will be assigned a realserver by the scheduler. The default timeout varies with LVS release, but is in the 300-600sec range.
When implementing LVS persistence, there are problems in recognising a client as the same client returning for another connection. While the application can recognise a returning client by state information e.g. cookies (which we don't encourage, see below for better suggestions), at layer 4, where LVS operates, only the IPs and port numbers are available. If it's left to the application to recognise the client (e.g. by a cookie), it may be too late, the client may be on the wrong realserver and the ssl connection is refused. For LVS persistence, the client is recognised by its IP (CIP) or in recent versions of ip_vs, by CIP:dst_port (i.e. by the CIP and the port being forwarded by the LVS). If only the CIP is used to schedule persistence, then the entries in the output of ipvsadm will be of the form VIP:0 (i.e. with port=0), otherwise the output of ipvsadm will be of the form VIP:port.
Recognising the client is simple enough for machines on static IPs, but people on dial-up links
- come up on a different IP for each dial-up session. If the phone line drops during a session the client will reappear with a different IP (but probably coming from the same class C network)
- if they are coming through a proxy (like AOL), they will come from different IPs (again probably in the same class C network) for different tcipip connections, within a single session (i.e. requests for hits for a web page may come from several IPs). (for more info see persistence granularity).
The solution to this is to set a netmask (e.g. /24) for persistence and to accept any IPs in this netmask as the same client. The downside is that if a significant fraction of your clients are from AOL, they will appear to be a single client and will all be beating on one realserver, while the other realservers are near idle.

Note
For regular http, you don't care how many different IP(s) the client uses to request its hits for a single webpage and you don't need persistence.
When all ports (VIP:0) are scheduled to be persistent, then requests by a client for services on different ports (e.g. to VIP:telnet, to VIP:http) will go to the same realserver. This is useful when the client needs access to multiple ports to complete a session. Useful multi-port connections are
- 20,21 for active ftp
- 21 and a high port for passive ftp
- port 80,443 for an e-commerce site
A side effect is that once persistence is set for all ports, requests by the client to any port, not just the ones you think the client is interested in, will be forwarded to the realserver. (The client will get a "connection refused" if the realserver is not listening on the other forwarded ports.) For security (to stop port scans etc), you'll have to filter requests to the other ports.
The ports won't neccessarily be paired in the way you want e.g. in the (admittedly unlikely) event that you have an ftp and e-commerce setup on the same LVS, both ftp and e-commerce requests will go to the same realserver. What you'd like is for the e-commerce (80,443) requests to be scheduled independantly of the ftp (20,21) requests. In this way your ftp requests will go to one realserver while your requests to the e-commerce site will go to a different realserver. Its simpler administratively to have different services (ftp, http/https) on a different lvs.
The all ports (VIP:0) approach is quite crude, and was a first attempt at bundling together connect requests for multiple services from a client. This side effect (of persistence activating all ports), does not arise if multiport services are forwarded by a persistent fwmark. To bundle services see fwmark (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.fwmark.html) - in particular persistence granularity with fwmark (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.fwmark.html#fwmark_persistence_granularity).

	Note
For regular http, you don't care how many different IP(s) the client uses to request its hits for a single webpage and you don't need persistence.

Note: the persistence timeout is the elapsed time, between different tcpip connections, for the client to be recognised as a returning client. You still have the same idle timeout within a tcpip connection as for other services.

Wensong Zhang wensong (at) gnuchina (dot) org 11 Jan 2001

The working principle of persistence in LVS is as follows:
a persistent template is used to keep the persistence between the client and the server.
when the first connection from a client, the LVS box will select a server according to the scheduling algorithm, then create a persistent template and the connection entry. the control of the connection entry is the template.
The late connections from the clients will be forwarded to the same server, as long as the template doesn't expire. The control of their connection entries are the template.
If the template has its controlled connections, it won't expire.
If the template has no controlled connections, it expires in its own time.

malcolm lists (at) netpbx (dot) org

What the maximum setting for the persistence timeout? The Docs say its unlimited but I don't believe that :-).

Horms 25 Aug 2006

ipvsadm may have some other limit due to signedness issues and the like. But in the kernel it is stored as an unsigned int, which represents seconds. So any value between 0 and (2^32)-1 seconds is valid, which is potentially a rather long time.

21.2. Single Session

Related to the concept of persistent connection (whether implemented with LVS persistence or any other method) is the concept of single session. The client must appear to have only one session, as if the server is one machine. You must be able to recognise the client when they make multiple connections and data written on one realserver must be visible on another realserver. Also see distributed filesystems.

K Kopper 7 Jun 2006

Let's say you are running Java applications (aka Java Threads) inside of a Java container (virtual machine) you should be able to tell the container itself how you want it to store session information (like a shopping cart). The method of storage can therefore automatically make the session information from one cluster node (real server) available to all cluster nodes via a file sharing technique, multicasting to all the nodes, or by storing data on a database server (on a backend HA pair). If you are five pages deep into a shopping cart, for example, and the real server crashes it won't be a problem if you land on a new real server with your next click of "submit" and it can pull up your session information.

Check out 7-2 (page 64) of this document for the Oracle approach: http://download-west.oracle.com/otn_hosted_doc/ias/preview/web.1013/b14432.pdf

Or for the JBOSS way using multicasting via JGroups: http://www.jgroups.org/javagroupsnew/docs/index.html

Building an Oracle OC4J container that is highly available on the HA backend to store session information for a cluster works and seems like a good sound approach to me. The multicast way raises many doubts in my mind (especially if you need to lock the session information for any reason).

K Kopper karl_kopper (at) yahoo (dot) com 6 Jun 2006

To share files on the real servers and ensure that all real servers see the same changes at the same time a good NAS box or even a Linux NFS server built on top of a SAN (using Heartbeat to failover the NFS server service and IP address the real servers use to access it) works great. If you run "legacy" applications that perform POSIX-compliant locking you can use the instructions at http://linux-ha.org/HaNFS to build your own HA NFS solution with two NFS server boxes and a SAN (only one NFS server can mount the SAN disks at a time, but at failover time the backup server simply mounts the SAN disks and fails over the locking statd information). Of course purchasing a good HA NAS device has other benefits like non-volatile memory cache commits for faster write speed.

If you are building an application from scratch then your best bet is probably to store data using a database and not the file system. The database can be made highly available behind the real servers on a Heartbeat pair (again with SAN disks wired up to both machines in the HA pair, but only one server mounting the SAN disks where the database resides at a time). Heartbeat comes with a Filesystem script that helps with this failover job. If your applications store state/session information in SQL and can query back into the database at each request (a cookie, login id, etc.) then you will have a cluster that can tolerate the failure of a real server without losing session information--hopefully just a reload click on the web browser for all but the worst cases (like "in flight: transactions).

With either of these solutions your applications do not have to be made cluster-aware. If you are developing something from scratch you could try something like Zope Enterprise Objects (ZEO) for Python, or in Java (JBOSS) there is JGroups to multicast information to all Java containers/threads, but then you'll have to re-solve the locking problem (something NFS and SQL have a long track record of doing safely). But you were just asking about file systems and I got off topic . . .

Christian Bronk chbr (at) webde (dot) de 02 Jun 2006

As long as you want AOL customers on your site, you will need single session server for your cluster (any sort of database will do). Every request from AOL comes from an different proxy-IP and even setting a persitence-netmask will not fix that.

malcolm lists (at) netpbx (dot) org 02 Jun 2006

The SH scheduler gives exactly the same kind of response as persistence and it's layer 4 based on source hash... Their are hundreds of session implementations for web servers, it's one of the first things web programmers should learn (i.e. INSERT INTO sessiontable.....) LVS doesn't do L7 because L7 should be done by your app (i.e. that's what L7 is for.)

Martijn Grendelman, 2 Jun 2006

I couldn't get the -SH scheduler to work (at the time not understanding the weight parameter) and I set up an Msession server for "session clustering" and used the RR scheduler. This setup works perfectly and is still in use today. However, since Msession is hopelessly outdated, and its successor (Mcache) doesn't seem to get off the ground, and I haven't found any workable (open source) alternatives, I would really like have another look at LVS persistence of some sort.

mike mike503 (at) gmail (dot) com 6 Jun 2006

IMHO storing data in blobs is a horrible idea.

If you are coding an application, I'd suggest checking out MogileFS. If this is for general purpose web hosting, where you need a normal POSIX filesystem to access, then that won't do. But for applications, it seems like a great idea (and from what small amount I read about the Google FS, it actually has a couple of the same traits)

As far as session management, a central session manager such as msession would work, or just roll your own off a database - it's simple in PHP (that is what I do) - then use DB failover/replication/etc. software to handle the DB clustering/failover.

21.3. Scheduling looks different under persistence

In a normal (non-persistent) LVS, if you connect to VIP:telnet with rr scheduling, you will connect to each realserver in turn. This is because the director is scheduling each tcpip connection as separate items. When you logout of your telnet session and telnet to the VIP again, the director sees a new tcpip connection and schedules it round robin style i.e. to the next realserver in the ipvsadm table.

However, if you then make the LVS persistent, the director schedules each CIP as a separate item. Repeated telnet tcpip connections (logins and logouts) to the VIP (within the persistence timeout period) will be regarded as the same scheduling item, since they are coming from the same client, and will all be sent to the same realserver. Even though rr scheduling is in effect, you will be connected to the same realserver. To test that the scheduler is round-robin'ing under persistence, you will need to login from several different clients (i.e. with different IPs), or after the persistence timeout has expired.

If two services are scheduled as persistent (here telnet, http), they are scheduled independantly. Here I have only 1 client (so it isn't a good test) and I connect twice by telnet and then twice by http. Scheduling is within the blocks setup by the `ipvsadm -A` command (here starting at "TCP ...". Here there are two blocks, scheduled separately.

ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:http rr persistent 360
  -> RS2.mack.net:http            Route   1      0          2
  -> RS1.mack.net:http            Route   1      0          0
TCP  lvs.mack.net:telnet rr persistent 360
  -> RS2.mack.net:telnet          Route   1      0          2
  -> RS1.mack.net:telnet          Route   1      0          0

Doing the same test a bit later, I found all connections going to the other realserver.

Will the timeout variable set on persistent connection affect an open socket that's open for several days streaming data?
Horms 2005/02/22
No. The persistance timeout has no effect whatsoever on the timeout of open connections. They have their own timeouts which are generally in line with those of TCP.
Will another connection from the same client go to a different realserver while there's an open socket with streaming data?
Not if you use persistance. If you use persistance, and either there is a connection open, or the persistance timeout has not elapsed since the last connection was closed, then a subsequent connection from the same end-user will go to the same real-server.
For those who care, this is all controlled by the expiry of connection entries and persistace templates by ip_vs_conn_expire().

21.4. Persistent and regular (non-persistent) services together on the same realserver.

If you setup both a non-persistent service (for testing, say telnet) and persistence on the same VIP, then all services will be persistent except telnet, which will be scheduled independantly of the persistent services. In this case connections to VIP:telnet would be scheduled by rr (or whatever) and you would connect with all realservers in rotation, while connections to VIP:http will go to the same realserver.

Example: If you setup a 2 realserver LVS-DR LVS with persistence,

director:/etc/lvs# ipvsadm -A -t VIP -p 360 -s -rr
director:/etc/lvs# ipvsadm -a -t VIP -R rs1 -g -w 1
director:/etc/lvs# ipvsadm -a -t VIP -R rs2 -g -w 1

giving the ipvsadm output

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> RS2.mack.net:0              Route   1      0          0
  -> RS1.mack.net:0              Route   1      0          0

then (as expected) a client can connect to any service on the realservers (always getting the same realserver).

If you now add an entry for telnet to both realservers, (you can run these next instructions before or after the 3 lines immediately above)

director:/etc/lvs# ipvsadm -A -t VIP:telnet -s -rr
director:/etc/lvs# ipvsadm -a -t VIP:telnet -R rs1 -g -w 1
director:/etc/lvs# ipvsadm -a -t VIP:telnet -R rs2 -g -w 1

giving the ipvsadm output

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> RS2.mack.net:0              Route   1      0          0
  -> RS1.mack.net:0              Route   1      0          0
TCP  lvs2.mack.net:telnet rr
  -> RS2.mack.net:telnet         Route   1      0          0
  -> RS1.mack.net:telnet         Route   1      0          0

the client will telnet to both realservers in turn as would be expected for an LVS serving only telnet, but all other services (ie !telnet) go to the same first realserver. All services but telnet are persistent.

The director will make persistent all ports except those that are explicitely set as non-persistent. These two sets of ipvsadm commands do not overwrite each other. Persistent and non-persistent connections can be made at the same time.

Julian

This is part of the LVS design. The templates used for persistence are not inspected when scheduling packets for non-persistent connections.

Examples:

ftp (LVS-NAT): connections to both ftp ports for passive ftp is handled by the module ip_masq_ftp. You don't need to add persistence for ftp with LVS-NAT.
ftp (LVS-DR or LVS-Tun): you need persistence on the realservers. Run the first set of commands above.
ftp and http (LVS-NAT): persistence not needed (ip_masq_ftp handles the ftp ports for active and passive ftp).
ftp and http (LVS-DR or LVS-Tun): persistence needed to handle the two port protocol ftp. If you just have one entry in the ipvsadm table (persistence to VIP:0) then a client connecting to the http service of the LVS will always get the same realserver (this may not be a great problem). If you want to make the http service non-persistent but leaving all other services persistent, then run then add a non-persistent entry for http.
http and https (all forwarding methods): Normally an https connection is made after the client has made selections on an http connection when data is stored on the realserver for the client. In this case the realserver should be made persistent for all services.

Note: making realserver connections persistent allows _all_ ports to be forwarded by the LVS to the realservers. An open, persistently connected realserver then is a security hazard. You should have filter rules on the director to block all services on the VIP except those you want forwarded to the realservers.

21.5. Tracing connections: where will the client connect next?

You can trace your system in the following way. For example:

[root@kangaroo /root]# ipvsadm -ln
IP Virtual Server version 1.0.3 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  172.26.20.118:80 wlc persistent 360
  -> 172.26.20.91:80             Route   1      0          0
  -> 172.26.20.90:80             Route   1      0          0
TCP  172.26.20.118:23 wlc persistent 360
  -> 172.26.20.90:23             Route   1      0          0
  -> 172.26.20.91:23             Route   1      0          0

[root@kangaroo /root]# ipchains -L -M -n
IP masquerading entries
prot expire   source               destination          ports
TCP  02:46.79 172.26.20.90         172.26.20.222        23 (23) -> 0

Although there is no connection, the template isn't expired. So, new connections from the client 172.26.20.222 will be forwarded to the server 172.26.20.90.

For 2.4 kernels

director:/etc/lvs# ipvsadm -Lc
or
director:/etc/lvs# ipvsadm -Lcn

This shows the state of the connection (ESTABLISHED, FIN_WAIT) and the time left till persistence timeout.

21.6. Bringing down persistent services.

	Note
	This is the behaviour before late 2004.

21.6.1. Clearing the table

If a client is connected (persistently) to a realserver and the ipvsadm table is cleared (ipvsadm -C) then the connection will hang. If you then reinstall the original ipvsadm rules for that service, the connection will work again (and you'll see the correct entries in ActiveConn and InActConn). Wensong (a bit below) explains why the code doesn't clear the entry, but only removes the pointer to the entry.

Ratz

In new versions of ip_vs (look for it Sep 2002 or later) you can affect the behaviour of ip_vs towards the connection when the ipvsadm table is cleared with a sysctl. Details are in the sysctl document http://www.linux-vs.org/docs/sysctl.html. e.g.

net.ipv4.vs.expire_nodest_conn=0
maintain entry in table (but silently drop any packets sent), allowing service to continue if the ipvsadm table entries are restored.
net.ipv4.vs.expire_nodest_conn=1
expire the entry in table immediately and inform client that connection is closed. This is the expected behaviour by some people when running `ipvsadm -C`

However if you have some client at the other end buying $1M of your software with his credit card, you want to be nice to them. The nice way of deleting a service is to set the weight to zero (when no new connections will be allowed to that realserver) and then wait for the current connections to disconnect/expire before deleting them (use some script to monitor the number of connections). Since the client can stay connected for hours (for some services) you can't predict when you'll be able to bring your server down.

21.6.2. time to clear quiescent persistent connections

In a normal (non-persistent) tcp connection, after setting a service to weight=0, the ipvsadm connection (hash) table will clear FIN_WAIT time (with Linux, about 2 mins) after the last client disconnects. With persistent connection, the connection table doesn't clear till the persistence timeout (set with ipvsadm) time after the last client disconnects. This time defaults to about 5mins but can be much longer. Thus you cannot bring down a realserver offering a persistent service, till the persistence timeout has expired - clients who have connected in recently can still reconnect.

Tim Cronin wrote:

if you're using pasv you need persistence....
/sbin/ipvsadm -A -t 172.24.1.240:ftp -p
#forward ftp to realserver 192.168.1.20 using LVS-NAT (-m), with weight=1
/sbin/ipvsadm -a -t 172.24.1.240:ftp -r 192.168.1.20:ftp -m -w 1
if I change the weigh of the .20 RIP to 0 and rerun the script my connections continue to go that server even when I zeroed and clear the table.

Julian 1 Nov 2002

Because the virtual service is marked persistent. In such case RSs with weight 0 can continue to accept _new_ conns.

21.6.3. Resetting timeout

The persistence timeout is not reset to the original timeout on each new tcpip connection, it is incremented by TIME_WAIT.

unknown (possibly Julian)

Yes, as implemented, the persistence timeout guarantees affinity starting from the first connection. It lasts _after_ the last connection from this "session" is terminated. There is still no option to say "persistence time starts for each connection", it could be useful.

Terry Green, 7 Feb 2003

Agree completely - however, I expected the template record to be reset to the session persistence time, not to the value of IP_VS_S_TIME_WAIT

Julian Anastasov 2003-02-08 2:21:35

The persistence timeout is used only once: when the first connection from this client is established. The current meaning is the persistent time to cover period of time after the client appears for first time. It is extended if there are still active connections. Then there are 3 (or more) options:

extend it again with the persistent time
extend it with 2mins
use the persistence time after the last connection from client terminates

The second option is implemented, as it was expected from other users :)

A long time ago my opinion was that it is good the persistent time to be used when the last connection terminates (3 above). This can be a config option, if someone wants to implement it.

unknown (Julian?)

Maybe you see it 20 seconds after the 2-minute cycle is restarted. It is "reset" only when its timer expires, not when the controlled connections expire.

Terry

Nope - perhaps I wasn't clear... I was watching ipvsadm -Lc every second. I did the tests originally and saw the template record being reset to 2 minutes if it expired with an active connection (even though the persistence setting for the connection was NOT 2 minutes). Then I did another connect from the client, and the template record was reset again to 2 minutes (not the persistence setting again), suggesting the template record data structure had somehow had it's persistence time reset from the original setting to 2 minutes.

Julian

Well, then it is not set to 1:40 but to 2:00 as expected.

Terry

Then, to prove to myself that my reading of the source was accurate, I hacked the source to make IP_VS_TIME_WAIT 2*50*HZ instead of 2*60*HZ, and with the newly compiled kernel, the template record started being reset to 100 seconds when it expired with an active connection.

Julian

True, your reading is accurate :) I now see why it was 1:40

Terry

My expectation would have been that the template record's timer would get reset to the session persistence value rather than to IP_VS_TIME_WAIT.

Julian

You can do it in your source tree or to implement it for other users as config option. I don't know what the other people think.

Ratz and another poster on 12 Aug 2004 like resetting the timeout to the persistence value.

21.6.4. Persistence is independant of scheduler

The scheduler determines which realserver gets the next connection. With persistence, the same realserver gets the next connection.

Horms 13 Sep 2004

Persistance opperates independant of the scheduler. It does not matter if you use the RR, WLC, DH or any other type of scheduler, it always works the same way. That is, it looks up a persistance template and if it finds one, then it uses it, else it asks the scheduler what to do.

In other words, if there was a connection from a given end-user, and the persistance timeout has not expired, subsequent connections from the same end-user (masked with the persistance netmask) will go to the same realserver. As this lookup occurs _before_ a call to the scheduler, it is not affected by quirks in any scheduler.

Brett

I have an LVS director that uses wrr with 3600 of persistence for two realservers. I noticed that connections going through a firewall from my internal network tend to get locked into one of my realservers but usually doesn't go to the other realserver unless all of the connections have expired to the first realserver.

Ratz 10 Aug 2004

Correct.

From what I understood with LVS is it's support to use the source IP for persistence but I wasn't sure if it also used a source port.

No, it doesn't. The persistent template is created as follows:

{proto,} caddr, 0, vaddr, vport, daddr, dport>

As you can see, the cport is set to 0 globally.

Horms

The source IP address is used, but the source port is not. This is because successive connections from the same host will almost certainly have a different ephemereal source port. There is no parameter in LVS to change this behaviour. Though off the top of my head it would seem like a simple hack to alter this if you needed to for some reason.

Would using a different scheduler or a kernel upgrade (with a new lvs version) work around this?

Horms

Not likely.

Ratz

You would need to tweak ../net/ipv4/ipvs/ip_vs_core.c:ip_vs_sched_persist().

21.7. Forcing a break in a persistent connection: Horms code (Nov 2004) for quiescing persistent connections

This patch was written to allow loadbalancing of https, with failover. However it can be used to force a break of a persistent connection. With persistent connection and the weight of a realserver set to 0, any new connections will go to other realservers, but existing connections will stay till they timeout or the client disconnects an active session (whichever is longest). Experience on the mailing list shows that this could be a long time. Misconfigured clients stay connected forever. This patch forces the client's connection to break. The client probably will not be happy about this, but then you may not want to wait 24hrs to do maintenance either.

Nicola Pero nicola (at) brainstorm (dot) co (dot) uk 25 Nov 2004

Has anyone been able to setup ldirectord to load balance two HTTPS servers with failover ?
The two real HTTPS servers are stateless (except for the SSL info in the web servers); there are few concurrent users (up to 10), but instant switchover in case of failure is essential.
Anyway, the problem we have is that when one of the two HTTPS servers goes down, the load balancer detects it but all clients connected to the server which is down keep being sent to it. Changing 'persistent', 'quiescent', timeouts etc didn't seem to have any effect on this!
Our case is also complicated by the fact that in certain cases we might decide that a realserver should not be used even if HTTPS is still running fine on the server. That might happen if the application sitting behind the HTTPS has a problem. We've got a URL on the realserver which can be checked to know if the realserver is OK to be used or not. Checking those seems to be working fine! The problem is with the realserver being marked as down, and all requests still being sent to it!
Keep in mind this is not a typical web farm, there are few concurrent users (most often 0 or 1), but it's critical that the web application is always available.

Malcolm Turnbull Nov 25, 2004

you definately want quiescent=no (in ldirectord).

Horms 26 Nov 2004

Or use this patch. http://www.in-addr.de/pipermail/lvs-users/2004-February/011018.html

The patch just makes persistent sessions behave sensibly(tm) when a realserver is made quiescent. This isn't specific to HTTPS at all, but I think it is the problem that the user is seeing. The other solution is not to make the realservers quiescent, and just removed them instead.

2.4 patch (http://www.in-addr.de/pipermail/lvs-users/2004-February/011018.html), 2.6 patch (http://article.gmane.org/gmane.linux.network/18906).

The Existing behaviour.

When a realserver is marked as quiescent (by setting the weight to zero) no additional connections will be allocated to that realserver by the scheduler (the LVS connection allocator, not the cpu scheduler, the packet scheduler, or your secretary).

This works quite well, unless the scheduler is bypassed for some reason. As it happens this occurs only if a virtual service is marked as persistant and there is a persistant-template in existance - that is, recently there was prior connection from the same end-user.

In this case the presance of the persistant-template is sufficient for additional connections to be sheduled, despite the fact that the server is marked as quiescent. Though the connections have to be from an end-user (IP address/netmask) that was forwarded to the realserver in question within the persistant timeout.

My patch allows this behaviour to be changed, by expiring the templates when a real-server is marked as quiescent. Thus the scheduler gets called, and the behaviour is the same as for non-persistant service, which is generally what people expect/want.

Joe

and just rip out the connections?

By removing a realserver you break all the connections and remove all the persitant templates. So no further connections are forwared whatsoever. Actually, no further packets are forwarded. Unfortunatelly, this breaks connections that are in progress.

So what happens in the following case:
You've filled your shopping cart under http, then you go to https to give your credit card info, which usually takes at least 3 webpages (fill in your credit card and shipping info, click send, get confirmation page, click accept, get final page for printing). Let's say while you're reviewing the confirmation page, the realserver goes down and the LVS removes it by running ipvsadm. The tcpip state of the client is ESTABLISHED and the client has the SSL session ID. The LVS has to cache the credit card info somewhere to make it available to the new realserver. When the user hits the accept button, the browser presumably is going to get a tcpip reset from the new realserver. Does the browser just handle it and attempt to make a new tcpip connection? From what you say above, the browser will find that its SSL session ID is invalid and it will do the long handshake. Once that happens the client will hopefully be SSL connected to a realserver that knows about the credit card transaction already underway.

In the situation you describe above the main factors in determining if it would work or not are

how do the realservers store their data?
if some sort of shared storage is used, say for example NFS, and the transaction is not in some half broken state, then it should be ok, though there might be a race in there
will the client's browser reconnect (either automatically or by the user hitting reload)?
the answer is generally yes.

What I am trying to say is it really boils down to an interaction between the end-user's browser and the real-servers web-application. The LVS magic in between neither hinders nor helps the situation, other than allowing the end-user to connect to a different-realserver if/when a reload occurs.

And the SessionID shouldn't really come into it. Because if it is still valid, it will be used, and if it is invalid it will be discarded and a full handshake will be performed. Sure, it might take an extra few moments, and possible the real-server might be a bit overloaded if a lot of reconnects of this nature occur simultaneously, but the success (or failure) of the SSL handshake should not be affected.

I had thought that the keys are in memory and you can't move the keys/session data from one machine to another.

That is not the case, let me elaborate. (I wrote an SSL implementation once so I know this one :)

SSL makes use of public key encryption (e.g. RSA) and private key encryption (e.g. DES, AES) as well as a host of other techniques to make communications more secure. In a nutshell public key encryption - which is slow but does not require any prior agreement of keys - is used to negotiate a key that is used for private key encryption - which is fast, but requires a key to be negotiated. This key negotiation phase is part of the SSL handshake.

It turns out, particularly for small transfers as are typical on the web, that the public key encryption negotiation phase of the handshake is quite expensive. To alleviate this the server may (almost always will) give the client a Session ID during the course of the handshake. i If the client reconnects it _may_ offer this Session ID and _if_ the server recognises it then an abridged version of the handshake is performed which relies on cryptographic information that both the client and server have cached.

Observe:

If the client does not offer a Session ID, the long handshake is performed.
If the server does not recognise the Session ID - perhaps because it has expired, perhaps because it is a different machine, perhaps because the Session ID is bogus - the long handshake is performed.
Also, if the client tries to guess a Session ID and guesses one that the server knows about, unless the cached key information it holds matches, the handshake will fail and the session will terminate. Guessing the cryptographic information is usually difficult at best, though it depends what cipher suite (combination of cryptographic algorithms) was used for the original session. Thus, DoS issues aside, guessing Session IDs is typically of little value. It is the second point above that allows failover of SSL servers to work. The server is actually allowed to cache the Session Id for as long as it wants, including discarding it immediately. This is catered for by falling back to the long handshake if the Session ID is not matched.

Stephane Klein

I have an LVS using persistence. All is working well until I stop a real server. The director continue to send requests to the real server which was stopped. ipvsadm -Lcn confirms that the request is still sent to the stopped real server.

Horms 2005/03/22

The problem here is that persistance still takes effect even after the real server is removed (I assume you have quiescent=1). You can change this behaviour by running.

echo 1 > /proc/sys/net/ipv4/vs/expire_quiescent_template

The effect of this is that the persistance templates are expired when a connection is made quiescent. And thus no additional connections will be directed to the real server in question.

21.8. what if a realserver holding a persistent (sticky) connection crashes

An explanation of the problem:

normal (non-persistent) connection to a service (e.g. httpd).

If the server crashes while your tcpip connection is open, that connection will hang (it will eventually time out). The client will notice some icon showing that the browser is continuing to look for the page. The director will notice that the realserver has died and will remove the realserver from the ipvs table by first setting its weight to 0. This will stop any new connections, but allow current connections to continue (and eventually exit). Since the current connections are hung, the director will assume they have exited after the time of the tcp timeouts. Once the connection table for that realserver is empty, the entries for the realserver are removed from the ipvs table. Eventually the browser will timeout or the user will reload, this establishing a new tcpip connection, whereupon the LVS will connect the user to a working realserver (the dead one not being sent any new connections). The connection with the original realserver was lost (or hung). The persistence of the connection is the tcpip connection - any new tcpip connection will be sent to a new realserver.

Clients are used to connections on the internet hanging and will not realise that a realserver died on them. The behaviour of ip_vs here gives a satisfactory behaviour as far as the client is concerned.

	Note
	If the service was telnet, the client would have a hung session and would have to close out their window and reconnect.[C This is not satisfactory, but there's no way to transfer a tcpip connection to a new machine.

persistent connection to a service (e.g. https) with -p 600 (10 mins timeout).
Everything is the same as for the non-persistent connection, except the criteria for terminating the user session.
If you have set a persistence timeout on the director of 10mins, then the director is saying "no matter what happens, I will connect this client to that realserver for all tcpip connection requests for the next 10mins (even if the realserver is dead)". The director is guaranteeing that the realserver will be up for the next 10mins and the persistence extends beyond any single tcpip connection to cover new tcpip connections in the timeout period. If the director sets weight=0 for a realserver (e.g. if it has crashed), then new tcpip connections from the client will still be sent to the same (dead) realserver.
The behaviour of ipvs, which satisfactorily removes realservers when the granularity is a tcpip connection, doesn't work when the LVS session can cover many tcpip connections.

Horms horms (at) verge (dot) net (dot) au 12 Apr 2004

Expire Quiescent Template. Here's the writeup.

This patch adds a proc entry to tell LVS to expire persistance templates for quiescent server. As per the documentation patch below:

expire_quiescent_template - BOOLEAN

0 - disabled (default)
not 0 - enabled

When set to a non-zero value, the load balancer will expire
persistant templates when the destination server is quiescent. This
may be useful, when a user makes a destination server quiescent by
setting its weight to 0 and it is desired that subsequent otherwise
persistant connections are sent to a different destination server.
By default new persistant connections are allowed to quiescent
destination servers.

If this feature is enabled, the load balancer will expire the
persistance template if it is to be used to schedule a
new connection and the destination server is quiescent.

The material below is older, from when the persistence code was at an earlier stage of development.

Ted Pavlic tpavlic_list (at) netwalk (dot) com

Is this a bug or a feature of the PCC scheduling...
A person connects to the virtual server, gets direct routed to a machine. Before the time set to expire persistent connections, that real machine dies. mon sees that the machine died, and deletes the realserver entries until it comes back up.
But now that same person tries to connect to the virtual server again, and PCC *STILL* schedules them for the non-existent real server that is currently down. Is that a feature? I mean -- I can see how it would be good for small outages... so that a machine could come back up really quick and keep serving its old requests... YET... For long outages those particular people will have no luck.

Wensong

You can set the timeout of template masq entry into a small number now and the connection will expire soon.

Or, I will add some codes to let each realserver entry keep a list of its template masq entries, remove those template masq entries if the realserver entry is deleted.

To me, this seems most sensible. Lowering the timeouts has other effects, affecting general session persistence...

I agree with this. This was what I was hoping for when I sent the original message. I figure, if the server the person was connecting to went down, any persistence wouldn't be that useful when the server came back up. There might be temporary files in existence on that server that don't exist on another server, but otherwise... FTP or SSL or anything like that -- it might as well be brought up anew on another server.

Plus, any protocol that requires a persistent connection is probably one that the user will access frequently during one session. It makes more sense to bring that protocol up on another server than waiting for the old server to come back up -- will be more transparent to the user. (Even though they may have to completely re-connect once)

So, yes, deleting the entry when a realserver goes down sounds like the best choice. I think you'll find most other load balancers do something similar to this.

mike mike (at) bizittech (dot) com 28 Sep 2003

I am using LVS-DR to balance 4 MS servers. Due to the nature of the web application and the user behavior I had to set the connection timeout to 30 min.
Note
Joe: he does not specify whether he is using persistence and this is the persistence timeout from setting up with persistence or the tcpip idle timeout. Presumably it is the persistence timeout.
In case of failure of one of the realservers users need to be forced to connect to a different server. That means the lvs tables need to cleared as far as connections from clients to the failed box, so that any reconnect trail will open new connection to one of functioning servers . I am using ldirectord to startup and monitor.
I am using ldirectord to poll the realserver for the result of an asp page. In case of failure it turns the weight to 0 on the ipvs rule. No new connections will be sent to the dead realserver but in every retry of the clients still tries to connect to the dead realserver until the timeout of that connection. This is the expected behaviour according to lvs documentation.

	Note
Joe: he does not specify whether he is using persistence and this is the persistence timeout from setting up with persistence or the tcpip idle timeout. Presumably it is the persistence timeout.

Joao Clemente jpcl (at) rnl (dot) ist (dot) utl (dot) pt

How do you delete the entry of the realserver?

Mike

Basically I'm using a similar rule to the one used to insert the virtual servers into lvs. It's something like this (I can't be 100% exact as I don't have access to my lvs box from home)
/sbin/ipvsadm -d -t $VIP:PORT -r $REALSERVER

Matthew Crocker matthew (at) crocker (dot) com 28 Sep 2003

Don't set the weight to 0, remove the realserver from the LVS table when it fails. When you remove the realserver from the table you also remove the information from the persistence table. Setting the weight to 0 is normally used for orderly shutdown of a realserver for maintenance.

	Note
	Joe: the entries for the current connections to the realserver stay in the ip_vs hash table until they timeout, even though they are no longer displayed in the default output of ipvsadm. These current connections can't be used with a dead realserver.

Peter Nash peter.nash (at) changeworks (dot) co (dot) uk 29 Sep 2003

I'm using LVS-NAT with persistence controlled by ldirectord. I've found that the "quiescent=" line in ldirectord.cf controls the behaviour you are looking for. If "quiescent=no" then when a realserver fails it's LVS entries are removed from the table and clients immediately failover to an alternate server. If "quiescent=yes" then when a realserver fails it's entries remain in the LVS tables but the weight is set to 0 and clients will continue to try to connect to that server until the persistence expires. The default setting (on my installation) was "yes" and I had to change this to get the behaviour I wanted.

Rommel, Florian Florian (dot) Rommel (at) quartal (dot) com 28 Sep 2003

in your ldirectord.cf, add this line at the top (above your virtual section)

quiescent = no

it deletes the server entry from the table automatically if the server fails. Once the server is back up it'll add itautomatically. If that line is not set, the default is yes, which just sets the server to weight 0 and that leaves the connections persistant. I had to look for a while to find that little line.

Mike

Thanks Florian Rommel and peter nash that was it.

vilsalio (atO eupmt (dot) es

I don't know how I can remove the persistence when one of my realservers crash, without waiting for expiration of the timeout.

ratz 27 Nov 2003

Please refer to the sysctl. /proc/sys/net/ipv4/vs/expire_nodest_conn should do what you want.

Patrick Kormann pkormann (at) datacomm (dot) ch

I have the following problem: I have a direct routed 'cluster' of 4 proxies. My problem is that even if the proxy is taken out of the list of real servers, the persistent connection is still active, that means, that proxy is still used.

Andres Reiner

Now I found some strange behaviour using 'mon' for the high-availability. If a server goes down it is correctly removed from the routing table. BUT if a client did a request prior to the server's failure, it will still be directed to the failed server afterwards. I guess this got something to do with the persistent connection setting (which is used for the cold fusion applications/session variables).
In my understanding the LVS should, if a routing entry is deleted, no longer direct clients to the failed server even if the persistent connection setting is used.
Is there some option I missed or is it a bug ?

Wensong Zhang wrote:

No, you didn't miss anything and it is not a bug either. :)

In the current design of LVS, the connection won't be drastically removed but silently drop the packet once the destination of the connection is down, because monitering software may marks the server temporary down when the server is too busy or the monitering software makes some errors. When the server is up, then the connection continues. If server is not up for a while, then the client will timeout. One thing is gauranteed that no new connections will be assigned to a server when it is down. When the client reestablishs the connection (e.g. press reload/refresh in the browser), a new server will be assigned.

jacob (dot) rief (at) tis (dot) at wrote:

Unfortunately I have the same problem as Andres (see below) If I remove a realserver from a list of persistent virtual servers, this connection never times out. Not even after the specified timeout has been reached.

Wensong

The persistent template won't timeout until all its connections timeout. After all the connections from the same client connection expires, new connections can be assigned to one of the remaining servers. You can use "ipchains -M -L -n" (or netstat -M) to check the connection table (for 2.4.x use cat /proc/net/ip_conntrack).

Only if I unset persisency the connection will be redirected onto the remaining realservers. Now if I turn on persistency again, a prevoiusly attached client does not reconnect anymore - it seems as if LVS remembers such clients. It does not even help, if I delete the whole virtual service and restore it immediately, in the hope to clear the persistency tables.
director:/etc/lvs# ipvsadm -D -t <VIP>; ipvsadm -A -t <VIP> -p; ipvsadm -a -t <VIP> -R <alive realserver>
And it also does not help closing the browser and restarting it. I run LVS in masquerading mode on a 2.2.13-kernel patched with ipvs-0.9.5. Would'nt it be a nice feature to flush the persistent client connection table, and/or list all such connections?

Wensong

There are several reasons that I didn't do it in the current code. One is that it is time-consuming to search a big table (maybe one million entries) to flush the connections destined for the dead server; the other is that the template won't expire until its connection expire, the client will be assigned to the same server as long as there is a connection not expired. Anyway, I will think about better way to solve this problem.

21.9. Load Balancing time constant is longer with persistence

(This is from a thread on 'Preference' instead of 'persistence' started by Martijn Klingens on 2002-10-08.)

Load balancing occurs with a time constant of the connection to the LVS. For a non-persistent connection like http, with FIN_WAIT=2mins, loads will balance on a time scale longer than 2mins. At shorter time scales, the loads will not be balanced. For persistence with a persistence time out of 30mins, load balancing will require times greater than 30mins (like several hours).

This problem is related to the unbalance caused by proxy farms (e.g. AOL).

21.10. The tcp NONE flag

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/04/26

What does the TCP flag NONE mean? When I make a connection through LVS to a real server and look in the connection table I normally get
TCP 17:24 ESTABLISHED 173.19.13.214:1736 173.19.15.175:80 173.19.12.243:80 
And everything including persistence works as expected But when I connect using a bit of javascript from IE(client side) I get :
TCP 17:24 NONE 173.19.13.214:0 173.19.15.175:80 173.19.12.243:80 

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com

This is the way LVS is manages persitence, by creating a NONE connection in the connection table.

And the first connection gets a 404 error, further refreshes work fine, and then persistence doesn't seem to work? Do connections with a status of NONE not get put is the persistence table? The javascript is refreshing a page from the server every 1 minute. If you set the javascript to go every 10mins you get far more 404 errors.

This looks strange. Your persistence timeout seems to be about 20min (the time to timeout is just after the string "TCP"), so 1min or 10min should be the same. I would suspect a problem in the js itself. Look at your server error_log.

This post (http://www.in-addr.de/pipermail/lvs-users/2005-February/013235.html) sugested droping all TCP NONE packets as they weren't required.

Your servers do not receive NONE TCP connections, they are created locally and are just here for perstence management purposes.

21.11. Resetting the persistence timeout counter (persistence behaviour for short timeout values)

Terry Green tgreen (at) mitra (dot) com 2003-02-06

the LVS-HOWTO states:
With persistent connection, the connection table doesn't clear till the persistence timeout (set with ipvsadm) time after the last client disconnects.
This appears to be not quite true. (In the following tests I'm using Kernel 2.4.19 with patch 1.0.7)
Testing/Observations - using a port 80 definition with 5 minute persistence (keepalived being used to do the configs).
# ipvsadm
  IP Virtual Server version 1.0.7 (size=16384)
  Prot LocalAddress:Port Scheduler Flags
     -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
  TCP  devlivelink:http rr persistent 300
     -> devlivelink2:http            Route   1      0          0
     -> devlivelink1:http            Route   1      0          0
  TCP  devlivelink:https rr persistent 300
     -> devlivelink2:https           Route   1      0          0
     -> devlivelink1:https           Route   1      0          0
I start a connection to the web server for purposes of downloading a large file (which will take more than 5 minutes). Every time I connect from the client, I see the connection template timeout reset to 5 minutes, as you would expect from the persistence timeout value (300sec).
# ipvsadm -Lc
  TCP 04:59  NONE        greenblade.mitra.com:0 devlivelink:http devlivelink1:http
  TCP 00:03  FIN_WAIT    greenblade.mitra.com:51330 devlivelink:http devlivelink1:http
  TCP 00:02  FIN_WAIT    greenblade.mitra.com:51329 devlivelink:http devlivelink1:http
Note
I've shortened the TCP timeouts for purposes of testing using IPVS connection entries with ipvsadm --set 5 4 0
However, if the template record is allowed to expire, it will be kept because there's still an active connection, but it's time will be reset to IP_VS_S_TIME_WAIT constant (defaulted to 2 minutes in ip_vs_conn.c) rather than to the persistence time set for this session. Further, the data structure for the connection template appears to have been corrupted, as any further connections from the client reset the template time to 2 minutes, instead of the original persistence time.
To verify this, I changed line 317 of ip_vs_conn.c from
# ipvsadm -Lc
        [IP_VS_S_TIME_WAIT]     =       2*60*HZ,
    to
        [IP_VS_S_TIME_WAIT]     =       2*50*HZ,
and recompiled the kernel
Rerunning the tests, I see the connection template record being reset to 1:40 instead of 2:00. Here's the IPVS connection entries (output of ipvsadm -Lc) as time progresses.
pro expire state      source                     virtual          destination
TCP 04:50 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 04:49 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:04 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 04:48 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:04 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 04:47 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

.
.

TCP 00:02 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:04 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 00:01 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

(here being reset to 1:40)

TCP 01:39 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 01:38 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

	Note
I've shortened the TCP timeouts for purposes of testing using IPVS connection entries with ipvsadm --set 5 4 0

Julian

Yes, as implemented, the persistence timeout guarantees afinity starting from the first connection. It lasts _after_ the last connection from this "session" is terminated. There is still no option to say "persistence time starts for each connection", it could be useful.

Terry

Agree completely. However, I expected the template record to be reset to the session persistence time, not to the value of IP_VS_S_TIME_WAIT.

Julian

extend it again with the persistent time
extend it with 2mins
use the persistence time after the last connection from client terminates

The second option is the one implemented, as we found it was the behaviour the users expected :) A long time ago my opinion was that it is better to use the persistence time when the last connection terminates (item 3 above). We could make this a config option if anyone wants it.

May be you see the value 20 seconds after the 2-minute cycle is restarted. It is "reset" only when its timer expires, not when the controlled connections expire.

Terry

Nope - perhaps I wasn't clear... I was watching ipvsadm -Lc every second. I did the tests originally and saw the template record being reset to 2 minutes if it expired with an active connection (even though the persistence setting for the connection was NOT 2 minutes). Then I did another connect from the client, and the template record was reset again to 2 minutes (not the persistence setting again), suggesting the template record data structure had somehow had it's persistence time reset from the original setting to 2 minutes.
Then, to prove to myself that my reading of the source was accurate, I hacked the source to make IP_VS_TIME_WAIT 2*50*HZ instead of 2*60*HZ, and with the newly compiled kernel, the template record started being reset to 100 seconds when it expired with an active connection.
My expectation would have been that the template record's timer would get reset to the session persistence value rather than to IP_VS_TIME_WAIT.

True, your reading is accurate :) I now see why it was 1:40

Joe - other people have found this behaviour too

chulmin2 (at) hotmail (dot) com 2003-02-11

I have set the persistence timeout to 30s

ipvsadm -A -t 211.1.1.1:80 -p 30

after I connected, I confirmed the settings

# ipvsadm -Lc
TCP 00:30.00 NONE        211.1.1.2:0     211.1.1.1:http 192.168.1.3:http
TCP 02:00.00 TIME_WAIT   211.1.1.2:40929 211.1.1.1:http 192.168.1.3:http

But after 30s the timeout returns to 2 mins.

TCP 02:00.00 NONE        211.1.1.2:0     211.1.1.1:http 192.168.1.3:http
   ~~~~~~~~~~
TCP 01:30.00 TIME_WAIT   211.1.1.2:40952 211.1.1.1:http 192.168.1.3:http

Here's Terry's summary:

I observed the same behavior, and traced it down to the scenario where the template record times out with valid connection records still counting down. In this case, the template record is reset to 2 minutes (actually, to the value of the IP_VS_S_TIME_WAIT constant). When this happens, the data structure record representing the template connection also gets altered, because any further connections from the client reset the template record to 2 minutes (NOT the original session persistence time).
The replies I got from Julian suggested that this behavior was intended, (and thus, I would suggest, the documentation is slightly inaccurate). I didn't pursue it too far, as I this only showed up when I was using really short persistence times for testing purposes. I don't expect it will happen too often or have too much impact when using a more practical session timeout time.

21.12. Why you don't want persistence for your e-commerce site: why you should rewrite your application

Malcolm Turnbull Malcolm.Turnbull (at) crocus (dot) co (dot) uk 18 Sep 2002

The main problem with using persistence for session variable tracking is that the only thing you are gaining by using LVS is increased performance. You are not getting any high availability i.e. if your realserver falls over during a persistent SSL session, you loose your shopping basket (or whatever).

Anyone using ASP/IIS will be well used to the service restarting all the time due to the 64MB ASP memory limit in IIS5 (wonder if they'll raise this in .net)

My wife always leaves web sites open for things like holidays/hotels etc so that when I come home I can see it... Often as soon as I click anything I loose the session.. :-(

Bad design. To save money on re-coding.. code it properly in the first place.

Joe Stump joe (at) joestump (dot) net 22 Nov 2002 (replying to another thread)

What Joe is trying to get at here (and this would apply to you PHP session users out there as well) is that your realservers should have access to your session files. The simple solution is a shared drive (under windows) or an NFS mount (under *NIX). Other solutions include NetApps, SANs, etc. The problem Devendra was having is that the session files exist independantly on each of the realservers. When a realserver dies or is taken out of the RAIC all of the session files on that realserver are gone. If they were in a central location where all servers had access they wouldn't die.

Joe

I've sat on https sites for more that 30mins of inactivity. Also I've had the modem line drop on me in the middle of filling in forms on badly written websites (eg registering a domainname), - when I come back, I have a new IP. I expect anyone who wants to do internet business to handle these problems seamlessly.

Roberto Nibali ratz (at) tac (dot) ch 10 Sep 2002

Exactly and most of the time you've got non-technical stakeholders or managers in the back that will rip your head of if that happens.

Persistence only gets you so far here, since memory requirements limit you to the number of connections maintained.

Yes, memory and timeout constraints combined in a linear fashion.

Ratz's idea (in the HOWTO) is to redesign the application. He can do that. Not everyone can. He maintains state data on the servers with a database.

Everyone can, and the other people can work with Tomcat's internal state replication module to do that. But it's slow last time I tested it (1 year ago) and tends to have nasty locking issues.

Alternately in php3 you could write the url that the client moved to on the next click to would contain the state information (functions the same as cookies).
Note
Joe Dec 2003: I thought this was the solution for quite a while. However I now find that since all the data is encoded in a long string as part of the URL, the client can manipulate it, making the data at the client and server different. You do not want this to happen.
If you can't rewrite the application, then you'll risk loosing some customers and I would say that LVS is not for you.
DoS problems are difficult for everybody. With persistence it's just worse.

	Note
Joe Dec 2003: I thought this was the solution for quite a while. However I now find that since all the data is encoded in a long string as part of the URL, the client can manipulate it, making the data at the client and server different. You do not want this to happen.

DoS problems are not to be solved on the LVS box.

Matthias Krauss 10 Sep 2002

what is the maximum timeout value

Joe

There is no maximum value. However the connection underneath will timeout eventually, and you will start to use a lot of memory with a large number of connections.

Julian

Note that setting 0 as RS weight is assumed as "stopped temporary". The existing connections continue to work. It is assumed that the RS weight is set to 0 some time before deleting the RS. By this way we give time for all connections/sessions to terminate gracefully. Sometimes weight 0 can be used from health checks as a step before deleting the RS. Such two-step realserver shutdown can avoid temporary unavailability of the realserver. Graceful stop. At least, the health checks can choose whether to stop the RS before deleting it.

Roberto Nibali ratz (at) tac (dot) ch 13 Sep 2002

It's also useful to introduce a service level window for maintainance work. If you have a service level agreement with only a few minutes downtime a year and you need to exchange the HD of one RS you can quiesce that particular RS about 1 hour before the maintainance work and if you have a resonable low or most of the time even no active connection rate, you unplug the cable, shutdown the server and fix the problem. Then you put it back in, set the weight>0 and off you go.

If the RS is deleted the traffic for existing conns is stopped (and if expire_nodest_conn sysctl var is set the conn entries are even deleted). Of course, if for some connections we don't see packets these conns can remain in the table until their timer is expired.

Bobby Johns bobbyj (at) freebie (dot) com 13 Sep 2002

When you add in the persistence problem I suspect you're doing something that's a bad idea. I suspect the reason you need persistence (or think you do) is because you're storing state or session information locally on each web server. Although it may work, it's a weak design for a web app. If you want a high performance solution, use a common server with something like MySQL on it to hold the session or state information. If you're nervous about the single point of failure on the database box, add a replicated sever behind it. Keeping state info on each web server is just a weak solution in a highly-available high-performance environment. Hardware is pretty cheap in comparison.

I would suggest 2 LVS servers running HA between them, 2 or more web servers, and 2 session/state db servers running replicated. Bang for the buck, it's a good solution and gives you a pretty resilient, robust, and scalable system. The system you are trying to implement now will hammer 33% of your user sessions if you have a web server failure and ALL of them if you have an LVS server failure. With the proper monitoring and HA, no single machine failure will hammer your users in the system I suggest. For the price of 6 or 7 Linux servers boxes, you have what people used to pay more than $100K for just a few years ago.

21.13. more about e-commerce sites: we used to think memory was the problem - it isn't

The original idea of persistence was to allow for connections like https sessions. This solved the problem of keeping the client's connection on the same realserver. However it doesn't work well.

The first problem is that it uses a lot of memory. The default timeout for LVS persistence is somewhere around 360secs, while the default timeout for a regular LVS connection via LVS-DR is TIME_WAIT (about 1 minute). This means that LVS persistent connections will stay in the LVS connection table for 6 times longer for persistent connection. As a consequence the hash table (and memory requirements) will be 6 times larger for the same number of connections/sec. Make sure you have enough memory to hold the increased table size if you're using persistent connections. If the persistence is being used to hold state (e.g. shopping cart), then you must allow a long enough timeout for the client to surf to another site for a better price, make a cup of coffee, think about it and then go find their credit card. This is going to be much longer than any reasonable timeout for LVS persistence and the state information will have to be held on a disk somewhere on the realservers and you'll have to allow for the client to appear on a different realserver later with their credit card information.

The next problem is that persistence doesn't allow for failover.

The memory problem really isn't as bad as was originally thought. Here's some exchanges on the mailing list which talk about the real problems.

Joe 18 Sep 2002

The conventional LVS wisdom is that it's not a good idea to build an LVS e-commerce website in which https is persistent for long periods. The initial idea was that a long timeout allows the customer to have a cup of coffee or surf to other websites while thinking about their on-line purchase.

Julian 18 Sep 2002

Yes, if your site uses persistence for HTTP/HTTPS then you better to use cookies (not LVS). If you don't care for the HTTPS persistence (any realserver can serve connections from one client "session") then you create normal service. In such case your care for the backend DB.

The problem with this approach is that the amount of memory use is expected to be large and the director will run out of memory. We've been telling people to rewrite their application so that state is maintained on the realservers allowing the customer to take an indefinite time to complete their purchase. Currently 1G of memory costs about an hour of programmer's time (+ benefits, + office rental/heating/airconditioning/equipment + support staff). Since memory is cheap compared to the cost of rewriting your application, I was wondering if brute force might just be acceptable. I can't find any estimates of the numbers involved in the HOWTO although similar situations have been discussed on the mailing list e.g.
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=99200010425473&w=2
there the calculation was done to see how long a director would hold up under a DoS. The answer was about 100secs for 128M memory and 100Mbps link to the attacker doing a SYN flood. I'm not running one of these web sites and I don't know the real numbers here. Is amazon.com or ebay connected by 100Mbps to the outside world?
What you can do with 1G of memory on the director? Each connection requires 128bytes. 1G/128 is 8M customers online at any one time. Assuming everyone buys something this is 1500 purchases/sec. You'd need the population of a large town just to handle shipping stuff at this rate. I doubt if any website at peak load has 8M simultaneous customers.
However you only have 64k ports on each realserver to connect with customers allowing only have 64k customers/realserver.

Note that the port limit is only between two IPs. You still can reuse one port for many connections if the two connections don't have same ends (IP and port).

How much memory do you need on the director to handle a fully connected realserver?
64k x 128 = 8M
Let's say there are 8 realservers. How much memory is needed on the director?
8 x 8M = 64M
this is not a lot of memory. So the problem isn't memory but realserver ports AFAIK

No, you don't waste realserver ports for connections from client to the LVS. But using many sockets in realserver hurts. Memory for sockets is a problem, sometimes the sockets can reserve huge buffers for data.

What is the minimum throughput of customers assuming they all take 4000 sec (66 mins) to make their purchase?
8 x 64k/4000 = 64 purchases/sec
You're still going to need a hire a few people to pack and ship all this stuff. If people use only take 6mins for their purchase, you'll be shipping 640 packages/sec.
Assuming you make $10/purchase at 64 purchases/sec, that's $2.5G/yr.
So with 64M of memory, 8 realservers, 4000sec persistence timeout, and a margin of $10/purchase I can make a profit of $2.5G/yr.
It seems memory is not the problem here, but realserver ports (or being able to ship all the items you sell).
Let's look at another use of persistence - for squids (despite the arrival of the -DH scheduler, some people prefer persistence for squids).
Here you aren't limited by shipping and handling of purchases. Instead you are just shipping packets to the various target httpd servers on the internet. You are still limited to 64k clients/realserver. Assume you make persistence = 256secs (anyone client who is idle for that time is not interested in performance). This means that the throughput/realserver is 256hits/sec. This isn't great. I don't know what throughput to expect out of a squid, but I suspect it's a lot more.

Ratz

Well, it depends what you want to offer. If it's an online shop like amazon.com you certainly want to store the generated cookie or whatever it is on a central DB cluster where every RS can connect to and request for the ID if it doesn't already have one.

The memory is a completely different layer. It's about software engineering and not about saving money. Yes, you can probably kill the problem temporary by adding more memory but a broken application framework remains a broken application framework.

Plus, normally when you do build an e-commerce site, you have a customer that has outsourced this task to your company. So you do a C-requirement and a feasability study to provide the customer with a proper cost estimation. Now you build the application and it is built in a broken way so that you need to either fix it or add more RAM in our case. The big problem here is:

you might have a strict SLA that doesn't permit this
you change the C-requirements and thus you need a new test phase
the customer gets upset because she spent big bucks on you

It's lack of engineering and a typical situation of plain incompetence: When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do.

But all this also depends on the situation. I don't think we can give people a generalised view of how things have to be done. One might argue that people come to this project because of monetary constraints and they sure do not care about the application if the problem is solved by putting more RAM into the director.

I for example rather spend a few bucks on good hardware and a lot of RAM for the RS because they need to carry the execution weight of the application. The director is just a more or less intelligent router.

pb (who has 1GB of memory and who wants to increase his persistence time to 60mins)

Wwe handle 1 million messages a day, and 20,000+ webmail users, thus 125,000 messages per hour send/recv in 8 hour work day. Would changing 15 to 60min Persistance on the LVS take up a lot of memory and processing (CPU/load) overhead? We're running 1gb of memory and dual pentium III.

Malcolm Turnbull malcolm (at) loadbalanceri (dot) org 27 Apr 2004

I would think it would be fine, 1 GB should handle almost 8 million connections in the timeout period i.e. 60 mins (or 2mins with no persistence).

Horms horms (at) verge (dot) net (dot) au 27 Apr 2004

I think Malcom is on the money here. Keep in mind that each connection entry / persistance timeout consumes something like 128bytes (actually, it might be a bit bigger now, but it is still in that ball-park). You can do the maths (actually you should, my brain has already checked out for the day), but if you are getting 100 connections/s, for an hour, each from a unique host, then you are still only going to end up using about 45Mb of memory for persistance entries. I doubt that will hurt you. I would also be supprised if you are getting connections from 360,000 unique hosts per hour :-)

Joe

for 4 realservers that's 40k messages/hr. I don't know how many tcp connections are required for a message transfer, but let's say it's 1. You have 15mins persistence, so 10k connections will be in existance at any one time.

For memory for the ipvs hash table: At 128bytes/connection, that's 1.28M of memory for the ipvsadm hash table. You have quite a margin with memory.

For disk and network I/O:

Let's say the average e-mail is 10kB. Each realserver is processing (10k messages/(15*60) secs) * 10kB = 0.1MBytes/sec. Your disks and network also have large margins of safety.

The error just a couple random people are having with WEBMAIL is "invalid session ID" as though they lost their connection to the realserver (actually a "message director") they were on. But I don't know if it is the "message directors" fault, or LVS.

have no idea, but I don't see any heavy load here. Are the clients timing out after 15mins and attempting to continue their session? WOuldn't the app/client know that the session has been closed and to go through the whole login procedure again? I don't know much about your app I'm sorry.

nothing here addresses the issue of persistence timeout. This is determined by how long you allow the client to be disconnected before you propagate to all realservers, the state changes in the realserver that occured in the last connection.

21.14. persistence with windows realservers

With unix realservers, we've been encouraging developers to rewrite the application (see rewriting your e-commerce application), to save client state in a failover safe fashion (i.e. in a place accessable by all realservers). Previously you would ask the client to accept a cookie or save the client state on the realserver to which the client connects (but where the state will be lost if the realserver fails). Rewriting the application is possible with unix, which gives you access to the primitives and you can build the application any way you want (provided that you have enough time and you understand the primitives well enough).

With Windows, you aren't given access to the primitives, but instead are given access to an API. If Windows has already coded up the function you want and you are happy to use that, then it's easy. If you want something else, you're SOL.

devendra orion dev_orion (at) yahoo (dot) com 22 Nov 2002

I need to enable loadbalancing on our curent director M/c (LVS-NAT enabled). We are currently having 3 realservers serving same website and need to be loadbalanced. The website is hosted on w2k and uses IIS session Mgmt (no cookies). Only problem is we need to keep this session alive for basically 8 hrs as our clients access the application continuously. How can I configure the loadbalancer to keep the connection persistence to same server after successful client login?

Joe (giving the party line)

The best solution is not to use persistence, but to re-write the application, so that the state information is stored in a place accessable to all realservers. In this way, if one realserver fails, the session with the client can continue.

Alex Kramarov alex (at) incredimail (dot) com 22 Nov 2002

I continuously hear on this list suggestions to rewrite applications to use other session management means then the one that comes with IIS. As a Windows/Unix developer/administrator (you can mix and match any one of the 2 groups ;), i would really like to say, that usually this is not that easy in the IIS environment, especially if you try to tell this to windows only developers, that don't know anything else then the MS way to do things. The best they can hope is to wait for the upcoming IIS 6 release, that includes session management that is meant to use in webfarms (db based), or to try some non microsoft (still proprietary) solutions, that try to do the same, like frameWERKS framework, that functions as a drop in replacement for the IIS session components.
I am not saying this to start a MS war on the list, but only to tell, that when an ms inclined person hears that he should "re-write the application" - 95% chance that this will be his last try to use lvs for his solutions. on the other hand, saying that there is a such and such solution that can help him will probably be considered...

(Alex has given an explanation of IIS session management below).

Tim Cronin tim (at) 13-colonies (dot) com 22 Nov 2002

We use IIS /w sessions and vls_nat and use wlc /w persistance. The presistance time must match the iis.session.timeout. We haven't had any problems, but we only have a 20min session.

21.15. IIS session management: how it works

Alex Kramarov alex-lvs (at) incredimail (dot) com 22 Nov 2002

Microsoft's COM model is similar to the CORBA model. Generally, you have components, i.e. code that can be used from other applications. The concept is similar to using shared libraries, but still a little different. You can create an instance of the component and use in a simple fashion with asp (the IIS scripting language IIS). Every time a new user calls for a asp file, a new session component is created, and can be accessed through asp scripting. This component can store data like a perl hash (session(valuename) = value). The data is stored in the memory space of the IIS process. Each session has a unique identifier that is remembered along with the data, and this identifier is maintained during the session by a cookie. On subsequent access by a client, the server looks up the data stored for this session, and makes it available as members of the session component.

a simple sample - access autorisation (this code goes on top of a pages you would like to secure):

' check if this user has identified already

If Session("UserID") <> "" Then
 'check some conditions.
 'if check successful
    Session("UserID") = "approved"
 'else do not allow to proceed showing the page
    response.write "authorization failed"
    response.end
End If

There are components available that will replace the default session component with one that will store the session data in a shared db, and only minimal modification to the code are required, if any. Generally the session component is an implicit component the server provides. You could use your own component that does the same thing, and the only thing you would have to do is to initialize an instance and give it the unique identifier of the user, like this (purely fictional code)

mySession = CreateObject("my.own.session.component")
mySession.setUniqueId = server.request("some data, a cookie or other parameter").

When writing code for MS servers, one almost never deals with files, since the interface provided by MS for that purpose is very cumbersome, and on Windows, file locking problems are very severe issue. With Unix, you can write a cgi that manipulates files, reads and writes to and from a dozen files while running. You would be crazy to that on windows. All data you want to store can be more or less conveniently stored using the session object if this is a per user data, or in the application object (more or less the same idea), which retains data through the life of the application (from the start to the stop of the http service). All data is in memory, hence, it is fast. Long term data is always stored in databases. I believe that the difference in perspective comes from the fact, that in unix, you can have a bare bones system because of your security requirements, and then you want to write a small script that uses and stores some data, so you open some files and do that. On windows, you CANNOT HAVE a bare bones system. From the initial install, you already have some file based db structure (comparable to db3), and all the database connectivity libraries, which you cannot remove, unless you are a windows guru and you start deleting system libraries one by one. (You would be crazy to do that, since there is absolutely no documentation what each of the thousands of library files, which are installed by default, do.) All these libraries are a security risk, as is proven by all the buffer overflow vunerabilities. But since windows developers regard DB connectivity as a standard component of their OS, they use it. (This is a marketing strategy of MS, to sell their MS bloated sql server).

why doesn't the application doesn't keep it's own state?:

IIS assigns a unique identifier to be used in session management the first time a user accesses an asp file (even if you need it for only 1% of the pages on your site). This is completely transparent to the developer, and saves time in the development process. Writing apps where state is conserved manually (without sessions), is not as easy as it looks, and the mechanism provided by IIS is certainly convenient. Coding using "the Microsoft Way" for IIS took me 4 hours to learn, going through microsoft developer network articles. It is simple if you don't stray from the dictated path, but the second you do stray, it's hard to push something not designed by MS into their framework, and people are afraid of that.

IIS 6 includes the option to make the server session object store data in an odbc database, but it is still not released. 3rd party components that should do the task are commercially available, like the frameWerks session component, and it is pretty cheap, at 149$.

I also believe that not a lot of sites need and use several realservers to serve a simple logical site, so this was never such an issue, till recently. Now, microsoft woke up to the fact and writing their own implementation for the IIS 6, which will undoubtedly require the use the MS sql server.

Mark Weaver mark (at) npsl (dot) co (dot) uk 23 Nov 2002

The default ASP (= MS web scripting) sessions simply use a cookie and stores session state in server memory. The session is a dictionary object, and you just store a bunch of key-value pairs. Since the standard session object stores data in memory, it is not a lot of use for /robust/ load balancing.
.NET adds a component that stores session in a database. Such a component is pretty trivial to write, we have had one for a number of years. Storing on disk is a good option when there is no database, but since most of the sites that we have are pretty dynamic (i.e. most pages are generated from database calls), storing the session state in the DB is a good bet. I can probably release the source code for this if anyone is interested.

21.16. messing with the ipvsadm table while your LVS is running

This is an example of persistence with firewall mark (fwmark).

Bowie Bailey

If I start a service with:
ipvsadm -A -f 1 -s wlc -p 180
and then change the persistence flag (setting the persistence granularity netmask to /24 with the -M option) to
ipvsadm -E -f 1 -s wlc -p 180 -M 255.255.255.0
how does that affect the connections that have already been made?

Julian 30 Jul 2001

The connections are already established. But the persistence is broken and after changing the netmask you can expect the next connections to be established to another realservers (not to the same as before the change).

(also see persistence netmask).

If IP address 1.2.3.4 was connected to RIP1 before I changed the persistence and then 1.2.3.5 tries to connect afterwards, would he be sent to RIP1, or would it be considered a new connection and possibly be sent to either server since the mask was 255.255.255.255 when the first connection happened?

New realserver will be selected.

unknown

Let's say: I have 1000 http requests (A) through a firewall of a customer (so in fact all requests have the same Source IP for Loadbalancer, because of NAT) and then one request (B) from the Intranet and then again 1000 Request (C) from that firewall, what does LB do? I have three Realservers r1, r2, r3 (ppc with rr)
a) A to r1, B to r2, C to r1 (because of SourceIP) [Distribution:2000:1:0.0000001]
b) A to r1, B to r2, C to r3 (because r3 is free) [Distribution:1000:1:1000]
c) A to r1, B to r2, C to r2 (due to the low load of r2) [Distribution:1000:1000:0.000001]
A to r1 && r2 && r3 (depending on source port),
B to r1 || r2 || r3,
C to r1 && r2 && r3 [Distribution: 667:667:666]

Ratz ratz (at) tac (dot) ch 12 Sep 1999

If C reachs the load balancer before all the 1000 requests of A expire, then the requests of C will be sent to r1, and the distribution is 2000:1:0.

If all the requests of A expires, the requests of C will be forwarded to a server that is selected by a scheduler.

BTW, persistent port is used to solve the connection affinity problem, but it may lead to dynamic load imbalance among servers.

Jean-Francois Nadeau

I will use LVS to load balance web servers (Direct Routing and WRR algo). I use persitency with a big timeout (10 minutes). Many of our clients are behind big proxies and I fear this will unbalance our cluster because of the persitent timeout.

Wensong

persistent virtual services may lead to the load imbalance among servers. Using some weight adapation approaches may help avoid that some servers are overloaded for a long time. When the server is overloaded, decrease its weight so that connections from new clients won't be sent to that server. When the server is underloaded, increase its weight.

Can we alter directly /proc/net/ip_masquerade ?

No, it is not feasible, because directly modifying masq entries will break the established connection.

21.17. Persistence for multiport services

Persistence was originally used to handle multiport services (e.g. ftp/ftp-data, http/https). While persistence is still the best method for ftp with LVS-DR, LVS-Tun, http/https is better handled by persistence granularity with fwmark.

21.18. Proxy services, e.g. AOL

Clients from AOL or T-online access the internet via proxies. Because of the way proxies can work, a client can come from one IP for one connection (eg port 80) and from another IP for the next connection (eg port 443) and will appear to be two different clients. Since there is no relation between the CIP and the indentity of the client, LVS cannot loadbalance by CIP. Usually these two connections will come from the same /24 netmask. Lars wrote the persistence granularity patch for LVS, which allows LVS to loadbalance all clients from a netmask as one group. If you set the netmask for persistence to /24 (with the -M option to ipvsadm) and all clients from the same class C network will be sent to the same realserver. This will mean that clients from AOL appear as a single (very active) client, and will likely take up all capacity on one realserver, leading to unbalance in load on the realservers. This is as good as we can do with LVS.

Wensong

If you want to build a persistent proxy cluster, you just need set a LVS box at the front of all proxy servers, and use the persistent port option in the ipvsadm commands. BTW, you can have a look at wwwcache.ja.net/JanetServices/PilotServices.html "how to build a big JANET cache cluster using LVS" (link dead, May 2002).

If you want to build a persistent web service but some proxy farms are non-persistent at client side, then you can use the persistent granularity so that clients can be grouped, for example you use 255.255.255.0 mask, the clients from the same /24 network will go to the same server.

Jeremy Johnson jjohnson (at) real (dot) com

how does LVS handles a single client that uses multiple proxies... for instance aol, when an aol user attempts to connect to a website, each request can come from a different proxy so, how/if does LVS know that the request is from the same client and bind them to the same server?

Joe

if this is what aol does then each request will be independant and will not neccessarily go to the same realserver. Previous discussions about aol have assumed that everyone from aol was coming out of the same IP (or same class C network). Currently this is handled by making the connection persistant and all connections from aol will go to one realserver.

Michael Sparks zathras (at) epsilon3 (dot) mcc (dot) ac (dot) uk

If ISP user (eg AOL) has a proxy array/farm then the requests are _likely_ to come from two possibilities:
A single subnet (if using an L4/L7 switch that rewrites ether frames, or using several NAT based L4/L7 switches)
A single IP (If using the common form of L4/L7 switch)
The former can be handled using a subnet mask in the persistance settings, the latter is handled by normal persistance.
*However* In the case of our proxy farm neither of these would work since we have 2 subnet ranges for our systems - 194.83.240/24 and 194.82.103/24, and an end user request may come out of each subnet totally defeating the persistance idea... (in fact dependent on our clients configuration of their caches, the request could appear to come from the above two subnets or the above 2 subnets and about 1000 other ones as well)
Unfortunately this problem is more common that might be obvious, due to the NLANR hierarchy, so whilst persistance on IP/subnet solves a large number of problems, it can't solve all of them.

Billy Quinn bquinn (at) ifleet (dot) com 05 Jun 2001

I've come to conclusion that I need an expensive (higher layer) load balancer node , which load balances port 80 (using persistence because of sessions) to 3 realservers which each run an apache web server, and tomcat servlet engine. Each of the 3 servers is independent and no tomcat load balancing occurs.
This has worked great for about a year, while we only had to support certain IP address ranges. Now, however, we have to support clients using AOL and their proxy servers, which completely messes up the session handling in tomcat. In other words, one client comes from multiple different IP addresses based on which proxy server it comes through.
It seems the thing to do is to adjust the persistence granularity. However, if I adjust the netmask, all of our internal network traffic will go to one server, which kind of defeats the purpose.
What I'm concluding is, that I'll need to change the network architecture (since we are all on one subnet), or buy a load balancer which will look at the actual data in the packets (layer 7?).

Joe

There has been comments by people dealing with this problem (not many), but they seem to be still able to use LVS. We don't hear of anyone who is having lots of trouble with this, but it could be because no-one on this list is dealing with AOL as a large slice of their work.

If 1/3 of your customers are from AOL you could sacrifice one server to them, but it's not ideal. If all your customers are from AOL, I'd say we can't help you at the moment.

My concern with that would be anyone else doing proxying ... now or in the future . I would not be opposed to routing all of the AOL customers to one server for now though . I guess we could have to deal with each case of proxying individually. I wonder how many other ISP's do proxying like that

How many different proxy IPs do AOL customers arrive on the internet from? How many will appear from multiple IP's in the same session and how big is the subnet they come from? (/24?)

Good question, I'm not sure about that one. The customer that reported the problem seemed to be coming from about 2-4 different IP addresses (for the same session ).

If AOL customers come from at least 3 of these subnets and you have 3 servers, then you can use LVS as a balancer.

Peter Mueller pmueller (at) sidestep (dot) com

Over here we also need layer-7 'intelligent' balancing with our apache/jakarta setup. We utilize two tiers of 'load-balancing'. One is the initial LVS-DR round-robin type setup, while the second layer is our own creation, layer-7. Currently we round-robin the first connection to one server, then that server calls a routine that will ask the second-tier layer-7 java monitor boxes which box to send the connection to. (If for some reason the second layer is down, standard round-robin occurs).
We're about 50% done with migration from cisco LD (yuck!) to LVS-DR. After the migration is fully complete the goal is to have the two layers interacting more efficiently and hopefully merged into one 'layer' eventually.. for example, if we tell our java-monitor second-tier controllers to shutdown a server, the first tier will then mark the node out of service automatically.
PS - we found the added layer-7 intelligent balancing to be about 30-50% (?) added effectiveness to cisco round robin LD.. I think the analogy of a hub versus a switch works fairly well here..

Chris Egolf cegolf (at) refinedsolutions (dot) net>

We're having the exact same problem with WebSphere cookie-based sessions. I was testing this earlier today and I think I've solved this particular problem by using fwmarks.
Basically, I'm setting everything from our internal network with one FWMARK and everything else with another. Then, I setup the ipvsadm rules with the default client persistence for our internal network(/32) and a class C netmask granularity (/24) for everything from the outside to deal w/ the AOL proxy farms.
Here's the iptables script I'm using to set the marks:
iptables -F -t mangle
iptables -t mangle -A PREROUTING  -p tcp -s 10.3.4.0/24 -d $VIP/32 \
--dport 80 -j MARK --set-mark 1
iptables -t mangle -A PREROUTING  -p tcp -s ! 10.3.4.0/24 -d $VIP/32 \
--dport 80 -j MARK --set-mark 2
Then, I have the following rules setup for ipvsadm:
director:/etc/lvs# ipvsadm -C
director:/etc/lvs# ipvsadm -A -f 1 -s wlc -p 2000
director:/etc/lvs# ipvsadm -a -f 1 -r $RIP1:0 -g -w 1
director:/etc/lvs# ipvsadm -a -f 1 -r $RIP2:0 -g -w 1

director:/etc/lvs# ipvsadm -A -f 2 -s wlc -p 2000 -M 255.255.255.0
director:/etc/lvs# ipvsadm -a -f 2 -r $RIP1:0 -g -w 1
director:/etc/lvs# ipvsadm -a -f 2 -r $RIP2:0 -g -w 1
FWMARK #1 doesn't have a persistent mask specified, so each client on the 10.3.4.0/24 network is seen as an individual client. FWMARK #2 packets are seen as a class C client network to deal with the AOL proxy farm problem. (for more on persistent netmask see the section in fwmark on fwmark persistence granularity).
Like I said, I just did this today, and based on my limited testing, I think it works. I'm thinking about maybe setting a whole bunch of rules to deal w/ each of the published AOL cache-proxy server networks (http://webmaster.info.aol.com/index.cfm?article=15&sitenum=2), but I think that would be too much of an administrative nightmare if they change it.
The ktcpvs project implements some level of layer-7 switching by matching URL patterns, but we need the same type of cookie based persistence for our WebSphere realservers. Hopefully, it won't be too long before that gets added.

Matthias Krauss MKrauss (at) hitchhiker (dot) com 2003-01-30

I turned on our lvs and it didnt take long for the phone rings from AOL people. The are switching between proxys with the result that the targed web will is different - we need it persitant.

Lars

The persistency netmask feature might help you, in exchange for lower granularity of the load balancing (but it shouldn't matter). However, all AOL users will then likely hit the same webserver. It just goes on to show that IP addresses are unsuitable to identify a single user ;-) Real fix would be to use layer7 switching based on the URL or a cookie, even; alternatively, you could make your application less dependent on persistence, for example by storing your session data in a global cache/db, which would also make it easier for you to preserve sessions when a single webserver fails.

I have now the persistency netmask feature up and it seems to work fine. All the sender networks are forwarded to 1 RIP and the load share on all RIP's is nearly equal. The AOL users are still complaining and I've got the impression that aol has different netmasks on their proxies. I found a list at http://webmaster.info.aol.com/proxyinfo.html and used this info for my fwmarks. Here's my iptables list
#mark all packets from these networks to VIP:80 with fwmark=3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  64.12.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  153.163.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  195.93.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  198.81.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  198.81.16.0/21 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  198.81.26.0/26 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  202.67.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  205.188.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
and for ipvsadm apply persistence granularity with fwmark.
#forward all packets with fwmark=3 with rr scheduler.
director:/etc/lvs# ipvsadm -A -f 3 -s rr -p 3600 -M 255.255.255.0
director:/etc/lvs# ipvsadm -a -f 3 -r $RIP1 -g -w 100
Of course their is no balancing anymore for the above nets, but fortunately we don't have many aol customers.
Alternately, we found a optional way by using p3p http headers which aol offers/describe under http://webmaster.info.aol.com/headers.html
Note
P3P is W3C's Platform for Privacy Policy.

	Note
P3P is W3C's Platform for Privacy Policy.

Here's another set of postings from Dec 2003, this time using fwmark to aggregate all the traffic from AOL. As with above, all connections from the proxy servers (i.e. all of AOL) will all go to one realserver.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 30 Dec 2003

For AOL clients, I need to use persistent connections. AOL make the IPs rotate very fast using several /8 or /16. Here's the list of AOL IPs for their clients
64.12.0.0 - 64.12.255.255
152.163.0.0 - 152.163.255.255
172.128.0.0 - 172.191.255.255
195.93.0.0 - 195.93.63.255
195.93.64.0 - 195.93.127.255
198.81.0.0 - 198.81.31.255
202.67.64.0 - 202.67.95.255
205.188.0.0 - 205.188.255.255
Can I handle this with fwmark?

Matthias Krauss MKrauss (at) hitchhiker (dot) com 30 Dec 2003

Here's the list of AOL proxies (http://webmaster.info.aol.com/proxyinfo.html).

#The proxy list from above

AOLPROXYS="64.12.96.0/19 152.163.188.0/21 152.163.189.0/21 152.163.194.0/21
152.163.195.0/21 152.163.197 152.163.201.0/21 \
152.163.204.0/21 152.163.205.0/21 152.163.206.0/21 152.163.207.0/21
152.163.213.0/21 152.163.240.0/21 \
152.163.248.0/22  152.163.252.0/23 195.93.32.0/22 195.93.48.0/22
195.93.64.0/19  198.81.0.0/22 198.81.8.0/23 \
198.81.16.0/21 198.81.26.0/23 202.67.64.0/21 205.188.178.0/21
205.188.192.0/21 205.188.193.0/21 205.188.195.0/21 \
205.188.196.0/21 205.188.197.0/21 205.188.198.0/21 205.188.199.0/21
205.188.200.0/21 205.188.201.0/21 \
205.188.208.0/21 205.188.209.0/21"

for aolproxys in $AOLPROXYS
do
  iptables -t mangle -A PREROUTING -p tcp -s $aolproxys -d VirtualIP/32 \
 --dport 80 -j MARK --set-mark 1
done

#-M is persistence netmask.
#It may not be needed since the netmask is already in the iptable command above
ipvsadm -A -f 1 -s wrr -p 3600 -M 255.255.255.0
ipvsadm -a -f 1 -r RealIP -g
#=> All listed AOL traffic is now going to VIP of machine with RealIP

I think you could concatenate some /21. Regarding both whois and AOL technical contact, I deduced :

AOLPROXYS="64.12.0.0/16 152.163.0.0/16 172.128.0.0/10 195.93.0.0/17 198.81.0.0/19 202.67.64.0/19 205.188.0.0/16"

Here's is my entry for the fwmark rule in keepalived.conf. Note the string "fwmark 1" which replaces "VIP port" as used in the standard setup. (Presumably "fwmark 1" is just a string which is passed to ipvsadm.)

virtual_server fwmark 1 {
    delay_loop 20
    lb_algo rr
    lb_kind DR
    persistence_timeout 1800
    persistence_granularity 255.255.255.0
    protocol TCP
    virtualhost www.toto.com
    real_server 172.16.1.4 80 {
        weight 1
        HTTP_GET {
            url {
              path /index.jsp
            }
            connect_port 8083
            connect_timeout 10
            nb_get_retry 5
            delay_before_retry 20
        }
    }
}

#And here is it for the "standard users" :

virtual_server $VIP 80 {
    delay_loop 20
    lb_algo rr
    lb_kind DR
    persistence_timeout 1800
    persistence_granularity 255.255.255.0
    protocol TCP
    virtualhost www.toto.com
    real_server 172.16.1.4 80 {
        weight 1
        HTTP_GET {
            url {
              path /index.jsp
            }
            connect_port 8083
            connect_timeout 10
            nb_get_retry 5
            delay_before_retry 20
        }
    }
    real_server 172.16.1.5 80 {
        weight 1
        HTTP_GET {
            url {
              path /index.jsp
            }
            connect_port 8083
            connect_timeout 10
            nb_get_retry 5
            delay_before_retry 20
        }
    }
}

So, in one file, you have both you're load balancing and you're HA settings. Of course, you need to configure iptables by yourself. Also, it doesn't do anything on the realservers (but on the realservers, I just need to configure the VIPs and noarpctl things, which is easy).

# ipvsadm -Ln

FWM 1 rr persistent 1800 mask 255.255.255.0
  -> 172.16.1.4:80                Route   1     0          0

For an example of using fwmark with keepalived, see ./doc/samples/keepalived.conf.fwmark in the source directory.

Casey Zacek cz (at) neospire (dot) net 2005/04/08

Here's the current aol proxy list (http://webmaster.info.aol.com/proxyinfo.html), in a more raw format, but it changes occasionally (I just had to update my list when I went looking for them):

	Note
	Joe: most of these are not class C

64.12.96.0/19

149.174.160.0/20

152.163.240.0/21
152.163.248.0/22
152.163.252.0/23
152.163.96.0/22
152.163.100.0/23

195.93.32.0/22
195.93.48.0/22
195.93.64.0/19
195.93.96.0/19
195.93.16.0/20

198.81.0.0/22
198.81.16.0/20
198.81.8.0/23

202.67.64.128/25

205.188.192.0/20
205.188.208.0/23
205.188.112.0/20
205.188.146.144/30

207.100.112.0/21
207.200.116.0/23

21.19. key exchanges (SSL)

Persistence is required for SSL services, as keys are cached.

Francis Corouge wrote:

I made a LVS-DR lvs. All services work well, but with IE 4.1 on secured connection, pages are received randomly. when you make several requests, sometime the page is displayed, but sometimes a popup error message is displayed
IE can't open your Internet Site <url>
An error occured with the secured connexion.
I did not test with other versions of IE, but netscape works fine. It works when I connect directly to the realserver (realserver disconnected from the LVS, and the VIP on the realserver allowed to arp).

Julian

Is the https service created persistent i.e. using ipvsadm -p ? I assume the problem is in the way SSL is working: cached keys, etc. Without persistence configured, the SSL connections break when they hit another realserver. It may be in the way the bugs are encoded. It also depends on the how the how the SSL requests are performed (which we don't know).

Notes from Peter Kese, who implemented the first persistence (pcc) (this is probably from 1999).

The PCC scheduling algorithm might produce some imbalance of load on realservers. This happens because the number of connections established by clients might vary a lot. (There are some large companies for example, that use only one IP address for accessing the internet. Or think about what happens when a search engine comes to scan the web site in order to index the pages.) On the other hand, the PCC scheduler resolves some problems with certain protocols (e.g. FTP) so I think it is good to have it.

and a comment about load balancing using pcc/ssl. (the problem: once someone comes in from aol.com to one of the realservers, all subsequent connections from aol.com will also go to the same server) -

Lars (who about this time implemented persistence granularity, so this might be from 1999 too).

Lets examine what happens now with SSL session comeing in from a big proxy, like AOL. Since they are all from the same host, they get forwarded to the same server - *thud*.

Now, SSL carries a "session id" which identifies all requests from a browser. This can be used to separate the multiple SSL sessions, even if coming in from one big proxy, and load balance them.

(unknown)

SSL connections will not come from the same port, since the clients open many of them at once, just like with normal http. So would we be able to differentiate all the people coming from aol by the port number?

No. A client may open multiple SSL connections at once, which obviously will not come from the same port - but I think they will come in with the same SSL id.

But like I said: really hard to get working, and even harder to get right ;-)

Wensong

No, not really! As I know, the PCC (Persistent Client Connection) scheduling in the LVS patch for kernel 2.2 can solve connection affinity problem in SSL.

When a SSL connection is made (crypted with server's public key), port 443 for secure Web servers and port 465 for secure mail server, a key (session id) must be generated and exchanged between the server and the client. The later connections from the same client are granted by the server in the life span of the SSL key.

So, the PCC scheduling can make sure that once SSL "session id" is exchanged between the server and the client, the later connections from the same client will be directed to the same server in the life span of the SSL key.

However, I haven't tested it myself. I will download ApacheSSL and test it sometime. Anyone who have tested or are going to test it, please let me know the result, no matter it is good or bad. :-)

(a bit later)

I tested LVS with servers running Apache-SSL. LVS uses the VS patch for kernel 2.2.9, and uses the PCC scheduling. It worked without any problem.

SSL is a little bit different.

In use, the client will send a connection request to the server. The server will return a signed digital certificate. The client then authenticates the certificate using the digital signature and the public key of the CA.

If the certificate is not authentic the connection is dropped. If it is authentic then the client sends a session key (such as a) and encrypts the data using the servers public key. This ensures only the server can read it since decrypting requires knowing the server private key. The server sends its session key (such as b) and encrypts with its private key, the client decrypt it with server's public key and get b.

Since both the client and the server get a and b, they can generate the same session key based on a and b. Once they have the session key, they can use this to encrypt and decrypt data in communication. Since the data sent between the client and server is encrypted, it can't be read by anyone else.

Since the key exchange and generating is very time-consuming, for performance reasons, once the SSL session key is exchanged and generated in a TCP connection, other TCP connections can also use this session key between the client and the server in the life-span of the key.

So, we have make the connections from the same client is sent to the same server in the life-span of the key. That's why the PCC scheduling is used here.

21.20. About longer timeouts

felix k sheng felix (at) deasil (dot) com and Ted Pavlic

2. The PCC feature....can I set the permanent connection for something else than the default value ( I need to maintain the client on the same server for 30 minutes at maximum) ?
If people connecting to your application will contact your web server at least once every five minutes, setting that value to five minutes is fine. If you expect people to be idle for up to thirty minutes before contacting the server again, then feel free to change it to thirty minutes. Basically remember that the clock is reset every time they contact the server again. Persistence lasts for as long as it's needed. It only dies after the amount of seconds in that value passes without a connection from that address.
So if you really want to change it to thirty minutes, check out ip_vs_pcc.h -- there should be a constant that defines how many seconds to keep the entry in the table. (I don't have access to a machine with IPVS on it at this location for me to give you anything more precise)

I think this 30 minute idea is a web specific time out period. That is, default timeout's for cookies are 30 minutes, so many web sites use that value as the length of a given web "session". So if a user hits your site, stops and does nothing for 29 minutes, and then hits your site again, most places will consider that the same session - the same session cookies will still be in place. So it would probably be a nice to have them going to the same server.

21.21. passive ftp and persistence

Wensong

Since there are many messages about passive ftp problem and sticky connection problem, I'd better send a separate message to make it clear.

In LinuxDirector (by default), we have assumed that each network connection is independent of every other connection, so that each connection can be assigned to a server independently of any past, present or future assignments. However, there are times that two connections from the same client must be assigned to the same server either for functional or for performance reasons.

FTP is an example for a functional requirement for connection affinity. The client establishs two connections to the server, one is a control connection (port 21) to exchange command information, the other is a data connection (usually port 20) that transfer bulk data. For active FTP, the client informs the server the port that it listens to, the data connection is initiated by the server from the server's port 20 and the client's port. LinuxDirector could examine the packet coming from clients for the port that client listens to, and create any entry in the hash table for the coming data connection. But for passive FTP, the server tells the clients the port that it listens to, the client initiates the data connection connectint to that port. For the LVS-Tunneling and the LVS-DRouting, LinuxDirector is only on the client-to-server half connection, so it is imposssible for LinuxDirector to get the port from the packet that goes to the client directly.

SSL (Secure Socket Layer) is an example of a protocol that has connection affinity between a given client and a particular server. When a SSL connection is made, port 443 for secure Web servers and port 465 for secure mail server, a key for the connection must be chosen and exchanged. The later connections from the same client are granted by the server in the life span of the SSL key.

Our current solution to client affinity is to add persistent client connection scheduling in LinuxDirector. In the PCC scheduling, when a client first access the service, LinuxDirector will create a connection template between the give client and the selected server, then create an entry for the connection in the hash table. The template expires in a configurable time, and the template won't expire if it has its connections. The connections for any port from the client will send to the server before the template expires. Although the PCC scheduling may cause slight load imbalance among servers, it is a good solution to connection affinity.

The configuration example of PCC scheduling is as follows:

director:/etc/lvs# ipvsadm -A -t <VIP>:0 -s pcc
director:/etc/lvs# ipvsadm -a -t <VIP>:0 -R <your server>

BTW, PCC should not be considered as a scheduling algorithm in concept. It should be a feature of virtual service port, the port is persistent or not. I will write some codes later to let user to specify whether port is persistent or not.

21.22. The Persistence Template (about port 0)

	Note
	Information about a persistent connection is stored in the Persistence Template. The essential difference between the Persistence Template and a hash table entry is that the source port from the client is marked as 0 (i.e.CIP:0). Thanks to Karl Kopper for off-line discussions which got me started here.

Horms 06 May 2004

A persistance template is just like a connection entry. It uses the same data structure. It is stored in the same hash table. The only difference is that the source port is set to 0 (i.e. CIP:0) so that the data can be identified as a persistance template. This means that it will never match a hash-table lookup for a connection entry. And a connection entry will never match a lookup for a persistance template which is made in the scheduling code.

The purpose of a persistance template is, in a nutshell, to effect persistance. When a connection is started for a persistant virtual service, the persistance template is looked up. If it exists then it is used - that is the connection will be forwarded to the same realserver as the previous corresponding connection. Otherwise the connection is scheduled, just like a connection for a non-persistant virtual service, and the persistance template is created.

Like connection entries, persistance templates have timeouts. Actually, again, it is handled by the same code. The only difference is that for persistance templates, the timeout is set by the persistance timeout configured using ipvsadm. Whereas for connection entries the timeout depends on the connection's state.

Let's say I make VIP:https persistent. There will be an entry in the hash table for VIP:https and another entry for VIP:0 in the persistence template.

No. When a connection from an enduser with CIP1 comes in then a persistance template will be created for CIP:0 (a port that doesn't exist). A Connection entry will also be made for CIP:ephemeral_source_port (i.e. the real port >1023 that the client is coming from).

How does ipvsadm know to make only VIP:https persistent?

I am not sure that I understand what you are getting at here. When you configure a virtual service using ipvsadm you can mark it as persistant. This sets a flag in the kernel for the virtual service that is checked by the LVS scheduler so it knows whether to treat the connection as persistant or not.

what happens if you use ipvsadm to enter persistence on all ports (ie VIP:0). Now you have a connection table entry with VIP:0 and a persistence template entry with CIP:0 and VIP:0?

I could check the code, but I don't really think there is a problem at all. The virtual service entry may have VIP:0, but the connecion entry (and persistance tepmpate) that is created will have VIP:XXX, where XXX is the destination port for the connection, which will be the same as the destination port in the connection.

All the connection entries and persistance templates are stored in a hash table. To retrieve an entry from the hash table, first the hash key is generated (how is not particularly relevant here) and then that bucket is searched for the matching entry. A match checks various values including the source port. As the following property is true, a persistance template can never be confused with a connection entry:

Persistance Template: Source Port = 0
Connection Entry:     0 > Source Port > 2^16-1

In other words. If you are looking for a persistance template then your search will always be for something with Source Port = 0. But if you are looking for a connection entry you will always be looking for something with Source Port != 0.

Thus there is no ambiguity, despite both types of entries using the same data structure and being stored in the same hash table.

New Connection comes in
look up Virtual Service
is Virtual Service persistant?
   No  -> call Scheduler to allocate a RealServer for the Connection.
          use result from Scheduler to create Connection Entry
   Yes -> does Persistance Template exist?
          No  -> call Scheduler to allocate a RealServer for the Connection.
                 use result from Scheduler to create Persistance Template.
          now Persistence Template exists.
          nothing special to do.
          use Persistance Template to create Connection Entry
forward packet using Connection Entry

Just think of persistence templates as special connection entries. Special entries effect which realserver subsequent connections from a given end user are allocated to. Rather than effecting which server packets for a current connection are sent to.

Timeouts just handle how long a connection entry or persistance template are valid for. Else they would live in the kernel forever.

Guy Waugh, Nov 18, 2003

In my LVS-NAT system (IPVS-1.0.9 + ldirectord), I have an Oracle server on the inside (web-db1) that primarily services the two realservers within the LVS. However, I also have a webserver (www1) on the VIP side of the network whose apache processes make Oracle connections through to the Oracle server on the inside of the LVS. To allow this, I have the Oracle listener service (port 1521) as an LVS service, with persistence set to 25200 seconds (7 hours).
I'm noticing a couple of different types of connections from www1 to the Oracle listener port on the VIP: one with a source port of 0, and one with a random source port, like so (the VIP is 'learn'):
[root@lvs1 gwaugh]# ipvsadm -Lc
IPVS connection entries
pro expire state       source    virtual     destination
TCP 419:41 NONE        www1:0    learn:152   web-db1:1521
TCP 01:38  TIME_WAIT   www1:2509 learn:1521  web-db1:1521
TCP 01:43  TIME_WAIT   www1:2560 learn:1521  web-db1:1521
Connections with a source port of 0 take on the persistence of 25200 seconds (as I have specified in ldirectord.cf), but connections out of a non-zero source port take on a persistence of 15 minutes (900 seconds). I see from http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.persistent_connection.html that:
For LVS persistence, the client is recognised by its IP (CIP) or in recent versions of ip_vs, by CIP:dst_port (i.e. by the CIP and the port being forwarded by the LVS). If only the CIP is used to schedule persistence, then the entries in the output of ipvsadm will be of the form VIP:0 (i.e. with port=0), otherwise the output of ipvsadm will be of the form VIP:port.
Can anyone tell me why I get both types of connections (source port 0 and source port non-zero)? Perhaps the 'source port 0' connection is some sort of 'master' connection, and the 'source port non-zero' connections are some sort of 'slave' connections?
What I'm really wondering is if it is possible to effectively make the persistence for this connection infinite? Perhaps I shouldn't use LVS to do this, but should use iptables instead...?
The problem underlying all this is that some apache processes on www1 seem to lose their Oracle connection over time, so any client hitting www1 who happens to get serviced by an apache process that has lost its Oracle connection gets Oracle connection errors all over the page. I see from http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.services.single-port.html#tcpip_idle_timeout that one can set TCP idle timeouts for connections with ipvsadm - perhaps this is what I should be doing?

Horms 18 Nov 2003

I think that you are confused between the concept of persistance and connection-timeouts. Persistance effects which realserver LVS will choose for new connections. If persistance is in effect and the persistant-timeout has not expired then the same realserver will be used for subsequent connections from the same CIP. But in your case you only have one realserver so persistance is a moot point.

You are correct in asserting that the CIP:0 entry you see is a master entry. Actually in the code it is refered to as a template. When a new connection comes in LVS looks for VIP:Vport+CIP:0. If it is present then it will use the attached RIP:Rport. If not it just chooses one of the available realservers as per the scheduling algorithm that is in effect. But again this is a moot point, as you only have one realserver.

The CIP:0 entry does not acctually represent a connection at all. Just a template for creating new connections. Its timeout should be set to the persistance-timeout each time the template is used to create a new connection.

The other entries are the connections themselves. Their timeouts are set by the various timeouts that can be manipulated through /proc/sys/net/ipv4/vs/timeout_*. This is where the value of 900 seconds comes from. __It has nothing to do with persistancy__

As per the HOWTO entry you listed above, some of these values can also be manipulated using ipvsadm --set

21.23. persistent clients behind a proxy or nat box

Jan Bruvoll 15 Aug 2005

I need to take a realserver out from one of my VS configs temporarily, and I have tried doing this by setting the weight of this particular realserver to 0. However, nothing is happening - the server is still receiving connections as before I made the adjustment. However, since I issued the command the number of active connections has actually increased, much in line with the distribution across the other servers in the same group, leading me to think that the config hasn't kicked in at all (the number of connections still seems very much like a weight of 5 is still active...) I've tried setting the persistence for this particular group to 5 seconds, without any noticeable effect.

Joe

Once they're connected they're connected, it doesn't matter what the timeout is. Still the number of connections shouldn't increase after you've changed the weight to 0.

Hm ok - my reasoning for doing that is that these clients are relatively long-lived SSL-based connection from an in-house application to our server park - and that by setting the persistence to 5 seconds, only connections that "come back" within 5 seconds of disconnecting from this particular server (for whatever reason - Apache timeout, client disconnection, network problems, etc.) would be directed to the same server - if not, they would then hopefully be directed to one of the servers with weight>0

Horms

This is a fairly simple problem, that is unfortunately difficult to explain. Let me try:

When you set a real server to be quiescent (weight=0), this means that no new connections will be allocated to that real server using the scheduler. However, if you have persistance in effect (which you do), and a new connection is recieved from a end-user that recently made a connection, then that connection will be allocated to the same real server as the previous connection. The trick is, this process by-passes the scheduler, and thus by-passes quiescence.

So, for a persistant service a new connection is processed a bit like this:

if (same end-user as a recent connection)
        use the real-server of that connection
else
        choose non-quiecent real-server using scheduler

Obviously this is a bit of a problem, for the reason you describe in your email. In implementation terms the problem is that when a connection for a persistant service is scheduled, a persistant template is created, with a timeout of the persitant timeout. This template is then used to select the real-server for subsequent connections from the same end-user. It stays in effect until its timeout expires. And its timeout is renewed everytime a packet is recieved for an associated connection. Which means in the case of quiesence, as long as end-users that have active persistance templates keep connecting or sending packaets within the persistance timeout, the real-server will keep having connections.

The solution to this is quite simple. The patch at the URL below, which has been included in recent kernel versions, adds expire_quiescent_template to proc. By default it is set to 0, which gives the behaviour discribed above, the historical behaviour of LVS (which I might add can be desirable in some situations). However, if you set it to 1, then connection templates associated with a quiesced real-server are expired, at lookup time. Which, in a nutshell means that the "if" condition above will always fall through the the "else" clause, and thus quiescence is not by-passed.

To effect this change just run the following as root

echo 1 > /proc/sys/net/ipv4/vs/expire_quiescent_template

The change effect new connections immediately.

Or, on systems that have sysctl, add the following http://archive.linuxvirtualserver.org/html/lvs-users/2004-02/msg00224.html to /etc/sysctl.conf and run

sysctl -p net/ipv4/vs/expire_quiescent_template= 1

This will also take effect immediately, and has the advantage that the change will be persistant across reboots.

Jan

Let me just check if I understand this correctly (using our current set-up):
Our original persistence was set to 360 seconds, intended to be longer than the expected recurring request frequency of our application, which checks with our server cluster every 300 seconds ("ish"). If I keep the original persistence, any client already known to the cluster requesting data from the cluster again -before- the 360 seconds expire of that particular client "id", will trigger a persistence counter reset for this client

Horms

Yes. The request could be opening a fresh connection. Or it could be the end-user sending (for any LVS forwarding meachism) or recieving (in the case of LVS-NAT) data for an existing connection. You can see the persistance teplates, and the progress of their timeouts, in amongst other connection entries if you run ipvsadm -Lcn. The persistance entries are the one with a client port of 0.

Jan

However, if the weight for a particular real-server is set to 0, no -new- clients should be allocated to this realserver

Horms

Yes, where new clients are one without a persistance template entry.

Jan

and any clients not "coming back" within the 360 seconds should be removed from the persistence map, and any new requests from same clients after being removed should be allocated to one of the other realservers

Horms

Yes. Though "going away" basically means no packets for existing connections, and no attempt to open a new connection.

Jan

since writing, I tried resetting the persistence manually to 5 seconds, in order to try and flush the persistence "map" quicker. This hasn't had any perceivable effect, as the number of connections to this server as I am writing now, still reflects the original weight (some 18 hours after setting the weight to 0).

Horms

Ok, obviously waiting 18 hours for connections to flush is impractical. What you are seeing is probably the result of either a bug in lvs, or very entusiastic end-users (i know they are just programmes, but hey), that send packets or recieve packets from the virtual service (and thus the real servers) at least once every 5 seconds. Some examintion of what is happening should shed some light on this: watch -n 1 ipvsadm -Lcn

Jan

how can it be that the number of active connections actually increases on the realserver whose weight is 0?

Horms

This is quite possible if a known (i.e. has a persistance template because it has sent or received data within the last 5 seconds (your timeout)) end-user opens a second connection, and no connections are closed. I'm not sure if this is actually what is happening, again ipvsadm -Lcn may help to show what is going on.

What you are seeing is a bit strange, and hopefully you can diagnose exactly what is going on. But please consider setting /proc/sys/net/ipv4/vs/expire_quiescent_template to 1, as it should give behaviour that better suits your needs.

Jan

Dawning suspicion here - if a connection some time ago triggered the creation of a persistance template, with the 360 seconds template, that template would actually stick around for as long as this client comes back to access the cluster - i.e. if I change the persistance of the virtualserver to, say, 5 seconds, that would only apply to -new- connections from clients previously "unknown" to the cluster, and the already existing template could only expire if the client goes away for more than 360 seconds, the original timeout?
Actually, I've given this some thought and I think I understand why this number can increase - it is ip address specific only, so if new clients appear from behind an ip that is currently "active", i.e. has a persistance template allocated to it, these would also be allocated to the already quiesced server. So, although I have assigned a weight of 0, the persistence templates wouldn't expire until all traffic subsides and goes away for longer than \$persistence seconds after the last connection closed. Cumbersome, but at least I can understand what's going on.

Joe

a bunch of different clients are coming out of a NAT box? like the AOL proxy farm problem?

Well - similar, but not as acute a scale. In this case we're talking about a client talking HTTPS to our servers, and several users behind the same broadband connection (typically at home), or several users behind the same leased line at work, could be connecting - and thus get caught in the same "bucket".

Horms

Good point, yes I am pretty sure that is how it works.

While I am at it; this seems a little odd, given that I have never set anything but persistances of either 360 seconds or 5 seconds:
app-2 ~ # ipvsadm -Lcn|grep 10.42.0.202|grep x.y.z.w
TCP 01:02  NONE        x.y.z.w:0   ext-ip:443 10.42.0.202:443
TCP 10:32  ESTABLISHED x.y.z.w:4254 ext-ip:443 10.42.0.202:443
How should i interpret that ~10 minutes expire timeout? (I have "worse" ones too, all the way up to close to 20 minutes)

Horms

I'm not sure, but its probably not a problem as once the connection changes out of the ESTABLISHED state, a fresh timeout will be assigned.

21.24. Rogue clients hidden by persistence

Leon Keijser errtu (at) gmx (dot) net 14 Dec 2005

This morning when i did a 'ipvsadm -ln' i saw something weird:

rpzlvs01 root # ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.50.10:3389 wlc persistent 43200
  -> 192.168.50.12:3389           Route   1      912        0         
  -> 192.168.50.15:3389           Route   1      22         0         
  -> 192.168.50.13:3389           Route   1      22         0         
  -> 192.168.50.16:3389           Route   1      21         0         
  -> 192.168.50.11:3389           Route   1      20         1         
  -> 192.168.50.18:3389           Route   1      21         0         
  -> 192.168.50.17:3389           Route   1      624        0         
TCP  192.168.50.120:1494 wlc persistent 43200
  -> 192.168.50.121:1494          Route   1      2          0         
  -> 192.168.50.122:1494          Route   1      3          0         
TCP  192.168.51.202:22 wlc
  -> 127.0.0.1:22                 Local   1      0          0

912 and 624 connections? When I check on the realserver, everything seems normal. I have 8 clients on one server, 30 on another, but maybe this is because LVS is thinking it already has 900+ connections, and shouldn't route anyone to there anymore. The logfiles don't show anything abnormal either.

Graeme Fowler graeme (at) graemef (dot) net

You're using persistence, which is probably a clue... What does ipvsadm -Lnc tell you? i That'll list the connections out so you should be able to see which clients are causing you the problem. You can grep the output for "ESTABLISHED" and/or "NONE" to see the active and persistent entries respectively.

Yep. I saw 2 IP's that occur several times (okay, several hundred times)

bear in mind that they may not be *your* clients. This could in theory at least be caused by something rogue.

Unfortunately they are my clients. And they are the linux-based thin clients I deployed as a side project. Turned out that they were hardcoded to use one of our domain controllers (which died yesterday night), and kept trying to connect to the cluster.

I'd guess you have a machine (or more than one) in your client base which is broken in some way. Which way I'll leave to you to find, but as these are RDP connections and the most likely clients are Windows machines...

Found them. Fixed them.

21.25. Long (1 day) persistence to windows terminal servers

Here I summarise a lengthy exchange with Joseph T. Duncan duncan (at) engr (dot) orst (dot) edu starting at http://marc.theaimsgroup.com/?l=linux-virtual-server&m=115706140606154&w=2 on 30 Aug 2006. Joseph uses SNMP to monitor his LVS.

Joseph's realservers are windows boxes at a university which serve any window app at all (e.g. statistics, word processing). Undergraduate students are mostly interactive doing homework, while faculty and grad students start lengthy jobs and return later.

The user can

logout of the real server, real server then closes all of their applications, stops their desktop session from running and correctly does a finack packet exchange ending the session.
'disconnect' (a windows option) from the realserver, the realserver puts their session in a disconnected state. Their desktop and application are still running. a finack packet exchange happens ending that tcp connection. After 10 minutes (adjustable) the realserver auto-logs off any 'disconnected' sessions, shuts the applications down and kills the desktop. This is called a clean disconnect.
if the client reconnects in this 10 minute window, the director should point them back at their disconnected session. the real server will then give them their running desktop and applications back.
if the client reconnects after the 10 minute window, the director should balance them as a new session since they no longer have a desktop+applications running on any realserver
The client shuts down and/or closes inappropriately the real server sits there, leaves the desktop and applications active. now here is were weirdness happens.
- In testing, if the client computer is a linux box and is still accessible, the realserver will close out the connection (broken connection tcp handshake attempt? handled correctly?)
- In testing, if the client computer is any of windows xp workstations my department maintains, iis still accessible or becomes accessible later (reboot, powered down and then back on later, etc), the realserver will close out the connection (broken connection tcp handshake attempt? handled correctly?)
- If the client computer is the weird 1/3 of my customer's home/office/whatever computer (ones with who knows what on them, or how they are configured) the desktop+apps just sit there running on the realserver and with no closing of the connection. There is an idle session autologout after x time setting on the realservers (here 1day) and the realserver kills off the active session (but idle connection) desktop+applications.
  These clients are the troublesome ones, because if they login a few times, they will wind up with a active but idle desktop+application session on each realserver. Two bad things happen.
  - applications can be running full bore (think long batch type jobs.. and use a 100% of a cpu, user processes are limited to a single cpu on the realservers, with 4 cpus avalible per real server) this isn't bad, but come finals week, each user could be eating up a cpu.
  - last write wins.. if a client had something open in the orphaned session on a realserver. Then gets a new desktop session on an different realserver, makes changes to a document, logs out correctly, then the orphaned session dies/closes.. they might loose work, or have their windows profile corrupted.
  The terminal server session/application does not know weather a disconnect is clean or not. All it does is start recording idle time from the last keyboard/mouse input received Applications running inside the terminal server session (e.g. m$ word) usually have no idea whether they're running on a desktop or a terminal server.
  I could make the active but idle timeout on the realservers much lower, but that would lead to unhappy proffessors that stay logged in overnight.
  It's not easy to look for idle sessions (to kill them). I don't know how to test for an idle connection on the director. There is a windows management tool that reports idle time, but I am not aware of mib/snmp way to export that information. There are some wireless labs behind a nat-proxy and they all come out on the same 10.x.x.x IP, so you can't test for multiple connections from the same IP.

Microsoft has a built in "network load balancing" based purely on network traffic between boxes. To make it more robust, they have something called "session directory" that up an authentication/identification will redirect an incoming comm to the approprate box in the "network load balanceing cluster"

This style of load balanceing is fine for certain applications, and is closer to an L7 approach as your session location is determined by your login/id, instead of your IP.

This method relies on lots of bandwidth. Each box participating in a "network load balanceing" setup receives a copy of everything and has to pick out what it's going to proccess... (all machines participating get a fake slaved mac address and ip shared between them)(shared mac address being m$'s solution to the arp problem?? i dunno)

However for terminal servers it is no good.. I need to account for %free cpu and %free memory as important metics. (e.g. with 1 realserver at 400% load (4xcpu@100%) and 30kbs network traffic with 4 users, another realserver with 80% load and 900kbs traffic with 10 users, I would want new users to land on the lower cpu load server, as I am not really bandwidth bound till I get above 1gb/sec. The windows way does not take any cpu/memory metrics into account.

When there is no LVS (i.e. a single terminal server, if a user had a dirty exit then either at 1 day of idle time they were killed off or if they reconnected, they got their session back. Thus I need persistence of a little more than 1 day with LVS, to make sure they reconnect with their last realserver.

Prev	Up	Next
20. LVS: Transparent Bridging	Home	22. LVS: Running a firewall on the director: Interaction between LVS and netfilter (iptables).