40. LVS: Geographically distributed load balancing

40. LVS: Geographically distributed load balancing
Prev		Next

LVS-Tun allows your realservers to be on different networks to your director i.e. anywhere, including on different continents.

This section is more ideas on geographically distributed serving, not all LVS.

40.1. Super Sparrow

Even better would be the ability to determine the closest realserver (by some internet metric) to any particular client. This has been done at least for http by Horms (of Ultra Monkey fame) with the Super Sparrow Project. Super Sparrow works differently than and is incompatible with LVS, and because it works already, it is correspondingly less likely that anyone will go to the pain of developing an LVS compatible version of geographical load balancing.

Super Sparrow uses zebra to determine the number of AS hops between client and server using BGP4 routing information. Documentation on BGP is hard to find (the early zebra docs had none). Horms suggests a http://www.netaxs.com/~freedman/bgp/bgp.html "BGP Routing Part 1" by Avi Freeman of Akami. It's somewhat Cisco centric and there is no part 2 yet, but is applicable to zebra. This site disappeared in Jul 2002 (look for cached versions), but Avi Freedman has his own webpage with some BGP links and a note that he's writing a book on BGP.

	Note
	2004: The documentation for zebra and dynamic routing is much better - see Dynamic Routing to handle loss of routing in directors.

Horms 30 Aug 2004

I have some code which creates a small routing daemon, that gets all its data from a list of routes you provide at run time. Its at ssrs (http://cvs.sourceforge.net/viewcvs.py/supersparrow/supersparrow/ssrs/). I also have some code to help generate the list of routes from BGP dumps from (Cisco) routers at inet_map (http://cvs.sourceforge.net/viewcvs.py/vanessa/scratch/inet_map/). And I have a patch to Bind 9 to add supersparrow support bind 9 support (http://www.supersparrow.org/download/wip/bind9/). All this needs a bit of polish, as appart from hacking on it for my own personal use I haven't done any work on it for a while. It does work, and is actually used for www.linuxvirtualserver.org

Fortunately the format of the /etc/ssrs.routes file is simple. Each line has a prefix followed by an AS number e.g.

212.124.87.0/24  1234
213.216.32.0/19  1235
195.245.215.0/24 7
195.47.255.0/24  7
217.76.64.0/20   1234
193.236.127.0/24 1234

All the prefixes are stored in a red-black tree and the most speficic prefix for the request will take presedence. If you have

212.124.87.0/24  1234
212.124.87.1/32  7

Then if you look up the AS for 212.124.87.1 you will get 7. If you look up the AS for 212.124.87.2 you will get 1234.

You can telnet to the ssrsd daemon, it will ask you for a password but doesn't actually check it, so just put in whatever you like - I should probably fix that up too :)

Josh Marshall Aug 09, 2004

I've been looking into implementing something like supersparrow to get high availability / fastest connection for our web servers. We have some servers in Australia and some in the USA and some in Holland. I'm interested in the dns method of getting the closest server for the client connecting so that we don't have to do http redirects and have multiple webnames configured. That's a bit further along.
I'm wondering if I need to have a bgp daemon with a public AS number to be able to get the information needed to determine the best path for the client. I have done some tests and read loads of documentation but am not sure how to get the information without having a public AS number. The supersparrow documentation describes what appears to be an internal solution so doesn't show whether this is possible or not.

Horms 14 Oct 2004

The way that supersparrow was designed is that you have access to BGP information for each site that you have servers located at. You do not need a public AS number to get this information, however you do need _read only_ access to your provider's BGP information. Unfortunately this can be difficult to get your hands on.

I guess I'm also wondering whether I should be looking at supersparrow - I know that the software was written a few years ago, but the idea behind it and the amount of processing it needs to do I can imagine it doesn't need to be actively maintained.

Yes it does have that apperarence. But I am actually in the process of sprucing it up a lot. Most of what I have so far is in the cvs repository http://sourceforge.net/cvs/?group_id=10726, http://www.vergenet.net/linux/vanessa/cvs.shtml. About the only thing of note still missing is the patch for bind 9 http://www.supersparrow.org/download/wip/bind9/ . But please feel free to play with what is there.

If anyone has any advice as to what I can do, to get the best path information with (or without) bgp without having a public AS number I'd really appreciate it.

I have been toying with a few ideas to cope with not being able to get access to BGP at colocation sites. One of the ideas that I had was to provide a static list of networks and what site they should map to. I implimented this as ssrsd which is in the CVS tree now. ssrsd understands that for instance 10.0.0.0/25 is part of 10.0.0.0/18 and will choose the site listed for 10.0.0.0/25 over the one for 10.0.0.0/18. Of course you still have to create the list somehow and at this stage it isn't at all dynamic. But it can work quite well.

However many people have thought about geographically distributed LVSs. for historical completeness here are some of their musings.

40.2. sharing/separate routers

Michael Sparks zathras (at) epsilon3 (dot) mcc (dot) ac (dot) uk 2000-03-08

I'm curious about the physical architecture of a cluster of servers where "the realservers have their own route to the client." (Like in LVS-DR and LVS-Tun) How have people achived this in real life? Does each realserver actually have it's own dedicated router and Internet connection? Do you set up groups of realservers where each group shares one line?

Nissim

It could do it this way or it can share resources. We've got 3 LVS based clusters, based around LVS-Tun. The reason for this is because one of the clusters is at a different location (about 200 miles from where I'm sitting) , and this allows us to configure all the realservers in the same way thus:

tunl0:1 - IP of LVS balanced cluster1
tunl0:2 - IP of LVS balanced cluster2
tunl0:3 - IP of LVS balanced cluster3 (remote)

The only machines that ends up getting configured differently then are just the directors.

So whilst machines are nominally in one of the three clusters, if (say) the remote cluster is overloaded, it can take advantage of the extra machines in the other two clusters, which then reply directly back to the client - and vice versa.

In that situation a client in (say) Edinburgh, could request an object via the director at Manchester, and if the machines are overloaded there, have the request forwarded to London, which then requests the object via a network completely separate from the director's and returns the object to the client.

That UK Nat cache likely to be introducing another node at another location in the country at some point in the near future which will be very useful. (The key advantage is that at each location we gain X more Mbit/s of bandwidth to utilise making service better for users.)

40.3. Other uses of BGP4 with LVS

	Note
	The thread here is a bit of a logical mess. The original postings are not in either archive, so I can't straighten it out anymore.

Joe:

how do I get BGP info from a BGP router to the director?

Lars Marowsky-Bree lmb (at) teuto (dot) net 23 Jul 1999

If you telnet to the BGP4 port (port 179) of the the router running BGP4

#telnet router bgp

and do a

"sh ip route www.yahoo.com"

for example, you will get something like this

Routing entry for 204.71.192.0/20, supernet
  Known via "bgp 8925", distance 20, metric 0
  Tag 1270, type external
  Last update from 139.4.18.2 1w5d ago
  Routing Descriptor Blocks:
  * 139.4.18.2, from 139.4.18.2, 1w5d ago
      Route metric is 0, traffic share count is 1
      AS Hops 4

This address is 4 AS hops away from me. You can also find out this information using SNMP if I recall correctly.

The most cool idea would be to actually run a routing daemon on the cluster manager (like gated or Zebra (see www.zebra.org)), then we wouldn't even need to telnet to the router but could run fully self contained using an IBGP feed. Zebra is quite modular even and could possibly be made to integrated more tightly with the dispatcher...

Joe

It must have been your other mail where you said that this was simple but not everyone knew about it. I just found out why. My cisco tcipip routing book has 2 pages on BGP. They told me to find a cisco engineer to "discuss my needs" with if I wanted to know more about BGP

There is actually some sort of nice introduction hidden on www.cisco.com, search for BGP4. "Internet Routing Architecture" from Cisco Press covers everything you might want to know about BGP4.

_All_ routers, participating in global Internet routing, hold a full view of the routing table, reflecting their view of the network. They know how many AS (autonomous systems) are in between them and any reachable destination on the network. If they have multiple choices (ie multiple connections, so called multi homed providers), they select the one with the shortest AS path and install it into their routing table.

Now, one sets up a dispatcher which has BGP4 peerings with all participatin g clusters. Since the dispatcher only installs the best routes to all targets in it's routing table, it is a simple lookup to see which cluster is closest to the client.

If a cluster fails, the peering with the dispatcher is shutdown and the backup routes (the views learned from other clusters) take over.

This is actually very nice and well tested technology, since it is what makes the internet work.

It requires cooperation on part of the ISP hosting the cluster, that he provides a read-only "IBGP" feed to the cluster manager inside his AS.

BGP4 AS hops may not be directly related to latency. However, it tells you how many backbones are in between you and how many are not, which does have a tight relationship to the latency. And you can use the BGP4 route-maps etc to influnce the load balancing on an easy way - if you got one cluster from which a certain part of the Internet is reached via a slow satellite link, you can automatically lower the preference for all routes comeing in via that satellite link and not use that.

Ted Pavlic tpavlic (at) netwalk (dot) com 9 Sep 1999:

For now AS hops probably are useful - we have two mirrors on different continents.

Lars

You do NOT and cannot run OSPF here. OSPF is an "IGP" (interior routing protocol) which can't be usefully applied here.

I suppose I figured large networks might all share OSPF information, but I guess that they wouldn't share too much OSPF information between different geographical locations. (And I'm guessing that the latency between the load balancer and the user will PROABABLY almost exactly the same as the latency between the end-servers and the user...so...) I never claimed that I knew much of anything about BGP or OSPF, but thought that if BGP wasn't very helpful... OSPF might be. :) (It was a shot in the dark - a request for comments, if anything)

Of course, you not only want to figure BGP4, but also load and availability. We should investigate what other geogrpahical load balancers do. A lot of them setup large proxy'ing networks.

AFAICT, a lot of geographical load balancing systems seem to use their own means of finding which server is best per end-user. I think, for decent load balancing on that sort of scale, most balancers have to invent their own wheel.

This by no means is an easy task -- doing decent geographical load balancing. Companies put in a good deal of R and D to do just this and customers pay a good deal of money for the results.

Worse comes to worst, have a perl script look-up the name of the machine that made the request... grab the first-level domain... figure out which continent it should go on.

DNS has no guaranteed reply times at all.

it wasn't a serious suggestion for production. Just simply a way to divy out which mirror got which request.

40.4. Geographically remote nodes connected by Bridging

Andy Wettstein awettstein (at) cait (dot) org 15 Jul 2003

I am trying to extend an lvs-dr to a different physical location with the help of an etherIP bridge. I am using 2 openBSD boxes to do this, if you want to see the details look at this almost all the way at the bottom: http://www.openbsd.org/cgi-bin/man.cgi?query=brconfig&sektion=8. I am not using IPSec so that is not causing me any problems.
Anyway, I have all normal LAN traffic working correctly, so I'm sure the EtherIP bridge is working correctly, but if I have a server that is in an LVS cluster the server never sees that traffic that is being sent to it as part of the cluster.

Ratz

Do you rewrite MAC addresses on the bridge? How does a tcpdump look like on all the director, the bridge and the node on the other side? How are the neighbour tables set up?

I don't do any MAC address rewriting on the bridge. This is my test service:

TCP  192.168.0.45:8000 wlc
  -> 192.168.0.48:8000    Route   1      0          0

The openbsd box with the director on its physical lan is set up like this (all real ips changed):

vlan0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
        address: 00:02:b3:d0:36:0d
        vlan: 57 parent interface: em0
        inet6 fe80::202:b3ff:fed0:360d%vlan0 prefixlen 64 scopeid 0x1a
        inet 192.168.0.1 netmask 0xffffff80 broadcast 192.168.0.127

gif1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1280
        physical address inet 172.20.1.2 --> 172.20.1.3
        inet6 fe80::206:5bff:fefd:ef23%gif1 ->  prefixlen 64 scopeid 0x30

bridge0: flags=41<UP,RUNNING>
        Configuration:
                priority 32768 hellotime 2 fwddelay 15 maxage 20
        Interfaces:
                gif1 flags=3<LEARNING,DISCOVER>
                        port 48 ifpriority 128 ifcost 55
                vlan0 flags=3<LEARNING,DISCOVER>
                        port 26 ifpriority 128 ifcost 55

The openbsd box with the member of the cluster (traffic never gets to it):

vlan1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
        address: 00:02:b3:d0:32:78
        vlan: 57 parent interface: em0
        inet6 fe80::202:b3ff:fed0:3278%vlan1 prefixlen 64 scopeid 0x1c

gif1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1280
        physical address inet 172.20.1.3 --> 172.20.1.2
        inet6 fe80::206:5bff:fe3e:6d58%gif1 ->  prefixlen 64 scopeid 0x31

bridge0: flags=41<UP,RUNNING>
        Configuration:
                priority 32768 hellotime 2 fwddelay 15 maxage 20
        Interfaces:
                gif1 flags=3<LEARNING,DISCOVER>
                        port 49 ifpriority 128 ifcost 55
                vlan1 flags=3<LEARNING,DISCOVER>
                        port 28 ifpriority 128 ifcost 55

the tcpdumps show only packets through bridge0 on the side of the bridge with the director on it. I can't see any traffic on gif1.

192.168.0.0 is subnetted so 192.168.0.143 goes through the openbsd box, which is also our router. That just gave me an idea. Testing from an IP that doesn't need to be routed...Works!!

So going through

192.168.0.143/26 -> 192.168.0.129/26 -> 192.168.0.48/25
                              ^^^
                              router interface on openbsd box (vlan2)

doesn't work, but going

   192.168.0.61/25 -> 192.168.0.48/25

without a route does work.

A little later...

I looked into this a little bit further. The problems I was having were mostly due to the OpenBSD firewall not keeping state on those connections that needed to be routed by that router/etherIP bridge machine. After I got that fixed traffic would show up on the cluster node and the node would try to reply, but I would never see the return traffic. So after doing a little further investigation the tcpdumps showed me that the traffic needed to be fragmented because on that bridge the mtu is 1280. So I set the mtu to 1280 on the the cluster node and everything works.
So you can add that as another way to geographically extend the LVS. Although it is a little inefficient since all broadcasted lan traffic gets transmitted, but that isn't a problem for me.

40.5. Load Balancing by DNS (round robin DNS)

Round robin DNS was one of the first attempts at loadbalancing servers. Horms tried it in the late '90's but was defeated by the cacheing of DNS information in local servers (you can set TTL=0, but quite sensibly, not all servers honour TTL=0).

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 24 Jun 2004

See the fud section on my site, under the GSLB bit (http://www.loadbalancer.org/fud.html, link dead Feb 2005) which goes to Why DNS Based Global Server Load Balancing (GSLB) Doesn't Work (http://www.tenereillo.com/GSLBPageOfShame.htm).

The summary of the page is that most browsers (netscape, IE) have their own DNS cache (15-30mins expiration). This, together with caching in local DNS servers, will defeat any attempt to loadbalance or failover by DNS.

40.6. BIND, BGP with load balancing (more ideas from Horms)

In a thread where someone suggested load balancing by round robin DNS...

jkreger (at) lwolenczak (dot) net Jun 24, 2004 suggested that you get your routing information from BGP, which is fed a fake table.

Horms

I have some code to do this. Basically it creates a small routing daemon, that gets all its data from a list of routes you provide at run time. Its here here (http://cvs.sourceforge.net/viewcvs.py/supersparrow/supersparrow/ssrs/). I also have some code to help generate the list of routes from BGP dumps from (Cisco) routers (http://cvs.sourceforge.net/viewcvs.py/vanessa/scratch/inet_map/">) and a patch to Bind 9 to add supersparrow support (http://www.supersparrow.org/download/wip/bind9/).

All this needs a bit of polish, as appart from hacking on it for my own personal use I haven't done any work on it for a while. It does work - it is actually used for www.linuxvirtualserver.org

40.7. Commercial Geographically Distributed Servers

How does a client in Ireland gets sent to a server in England, while someone on the east coast of USA gets a server in NewYork? The machine name is the same in both cases.

Malcolm lists (at) loadbalancer (dot) org 21 Nov 2006

You either use:

Server side re-direct
DNS based geographic load balancing
BGP
Combination of BGP and Geographic DNS

UltraDNS.com do a managed service for this at about $400 per month per DNS entry, which is one of the best ways of doing it.

Josh Marshall josh (at) worldhosting (dot) org 22 Nov 2006

We use the supersparrow software (written by Horms) on our DNS servers and it works really well for sites between our Australian and Holland datacenters. I don't have the co-operation of our uplinks so I fake the BGP and with a few scripts it also handles failover to one site. My employer's site www.worldhosting.org is handled this way.

First you have to run a patched version of bind9 (I have debian packages for anyone who needs them) - get the source from http://www.supersparrow.org/ Or add the following to your /etc/apt/sources.list, for my supersparrow and patched bind9 packages deb http://debian.worldhosting.org/supersparrow sarge main (woody packages also available, replace sarge with woody)

Create in your bind config something like:

zone "www.worldhosting.org" {
       type master;
       database "ss --host 127.0.0.1 --route_server ssrs --password XXXX \
          --debug --peer 64600=210.18.215.100,64601=193.173.27.8 \
          --self 193.173.27.8 --port 7777 --result_count 1\
          --soa_host ns.worldhosting.org. --soa_email hostmaster.worldhosting.org.\
          --ns ns.worldhosting.org. --ns ns.au.worldhosting.org. --ttl 7 --ns_ttl 60"; \
};

This snippet sets the www to use 210.18.215.100 if the peer is set to 64600 and 193.173.27.8 if the peer is 64601, the ttl for the A record is 60 seconds and the self is the default response for this nameserver (on the secondary nameserver make this the other address). Set the password to the same as in /etc/supersparrow.conf

Create three files to describe the routes in normal and failed modes. In our setup:

$ cat ssrs.routes.AUonly
0.0.0.0/0       64600

$ cat ssrs.routes.NLonly
0.0.0.0/0       64601

$ head ssrs.routes.normal
128.184.0.0/16  64600
128.250.0.0/16  64600
129.78.0.0/16   64600
129.94.0.0/16   64600
129.96.0.0/16   64600
129.127.0.0/16  64600
129.180.0.0/16  64600
130.56.0.0/16   64600
130.95.0.0/16   64600
130.102.0.0/16  64600

The ssrs.routes.normal file contains all the subnets you wish to force to use the respective peer.

Create a script that does a http test periodically (we do it every 5 minutes as the web servers don't go down frequently) if both sites work, symlink the file to /etc/ssrs.routes. If only one works, symlink the file for the site that works (i.e. AUonly or NLonly) to /etc/ssrs.routes. Then check to see if the config has changed and if so, restart supersparrow. I use the check_http script from the nagios package to do the test. See below for my script:

----------------

#!/bin/sh

PATH=/sbin:$PATH
# Supersparrow results
SSNORMAL=0
SSAUONLY=1
SSNLONLY=2

AUIP=210.18.215.100
NLIP=193.173.27.8

AUW=0
NLW=0

#ping -c 2 $AUIP >/dev/null && AUP=1
#ping -c 2 $NLIP >/dev/null && NLP=1

/sbin/check_http -H $NLIP -u /index.html -p 80 -t 20 >/dev/null && NLW=1
/sbin/check_http -H $AUIP -u /index.html -p 80 -t 20 >/dev/null && AUW=1

# Do the tests again in case there was a hiccup

/sbin/check_http -H $NLIP
/sbin/check_http -H $AUIP -u /index.html -p 80 -t 20 >/dev/null && AUW=1


if [ $NLW -eq 1 ]
then
       if [ $AUW -eq 1 ]
       then
               OPMODE="Normal Operation"
               SPARROW=$SSNORMAL
       else
               OPMODE="NL running but AU down"
               SPARROW=$SSNLONLY
       fi
else
       if [ $AUW -eq 1 ]
       then
               OPMODE="AU running but NL down"
               SPARROW=$SSAUONLY
       else
               OPMODE="AU and NL down"
               SPARROW=$SSNORMAL
       fi
fi

if [ $SPARROW -eq $SSNORMAL ]
then
       ln -sf /var/named/supersparrow/ssrs.routes.normal /etc/ssrs.routes
fi

if [ $SPARROW -eq $SSAUONLY ]
then
       ln -sf /var/named/supersparrow/ssrs.routes.AUonly /etc/ssrs.routes
fi
if [ $SPARROW -eq $SSNLONLY ]
then
       ln -sf /var/named/supersparrow/ssrs.routes.NLonly /etc/ssrs.routes
fi

md5sum -c /etc/ssrs.routes.md5sum &>/dev/null && exit
/etc/init.d/supersparrow reload
md5sum /etc/ssrs.routes > /etc/ssrs.routes.md5sum
echo Supersparrow: $OPMODE

-------------

With a DNS server at each location, if there is a international routing problem that prohibits them communicating with each other, then the server will set all responses to point the www at the local hosting location. Then any sites on the net that can get to that DNS server will use the www that is there (and therefore, high chances of working)

Ratz 22 Nov 2006

If we are talking about web services, a nice but not very known and sometimes also not feasible approach is the proxy.pac URL hash load balancing, best explained at: http://naragw.sharp.co.jp/sps/ and http://naragw.sharp.co.jp/sps/sps-e.html .

40.8. from the mailing list

David Carlson dcarlson (at) culminex (dot) com 11 Dec 2002

We are putting a bid in on a fairly major web site. The client has asked for 24/7/365 reliability. We were initially going to bid a Linux virtual server direct routing solution with main and backup Linux directors in a multihomed data centre. We were proposing the following hardware:
Linux Director and Backup director to route the requests to Real servers on the LAN
Real servers 1 and 2 to do the work and route data back to the user
DB server 1 to provide the data to the realservers.
However, our partner has come up with an interesting wrinkle. They have a second data centre where they can host a mirror of our site. It uses a different company for main internet service, so it is not only geographically removed, but has different power and internet service too.
We are now going back and revisiting out hardware configuration. It would seem that with two physical locations, we should use IP tunneling. http://www.linuxvirtualserver.org/VS-IPTunneling.html. In this case, our hardware configuration would be
At Main location: Linux director, Real page server 1, DB Server 1
At alternate location: backup linux director, Real page server 2, DB server 2
We've never done this before. But if it works, it would sure increase our claimed reliability as we can talk about multihomed, geographically separate, entirely redundant systems.
My questions are - what do we do with the Linux Director at the main site to have a failover solution. If the internet service to the main site fails, how does the alternate site know to take over receiving requests? Given that it is elsewhere on the WAN, how does the backup site update local routers with the virtual IP? Do we need a backup Linux director at the alternate site? What about if the main site Internet is OK but the Main Linux director fails. Will a backup director at the alternate site take over and still send requests to realserver 1 at the main site.

Horms:

It probably needs a bit of TLC but it is pretty simple so should work without too much bother. I'd be quite happy to work with someone to make this so. I'm using it myself on a very small (experimental) site wihtout too much bother. I also have a patch to make it work with bind9 that I'm happy to send anyone who is interested.
Peter Mueller pmueller (at) sidestep (dot) com: The bind9 patch uses recursive-DNS to geo-locate users and send them to specific locations? (Is this what 3DNS does?)
The bind9 patch, works in conjunction to return DNS results based on the source IP address of the DNS request - i.e. the IP address of the DNS server. So yes, it tries to return results based on someonese network/geographic location. I can't really comment on 3DNS as I have not used it, but I believe that it does something similar.

Matthew S. Crocker matthew (at) crocker (dot) com

This is what Akamai does. They use BGP table information to build a network map and then announced DNS information based on the closest server. For example, I have 3 akamai servers in my network. When I use DNS to lookup a.1234.akamai.net I get the IP address of my servers. If I go into some equipment on a different network the same name gives me different IPs. It is always the IP closest to me network wise.
Go to www.msnbc.com, view source and search for akamai.net. You'll see the reference a799.g.akamai.net. Traceroute to that name; from home it is 63.240.15.152, 12 hops away. From work, it is one of my IP's and 2 hops away :). Pretty cool actually.
Using DNS to load-balance between clusters based on network BGP data. Then have each cluster setup as a LVS HA cluster to load balance locally.

Prev	Up	Next
39. LVS: L7 Switching	Home	41. LVS: Linux Distributions prepatched with LVS, Unsupported LVS addons