Note | |
---|---|
see also Julian's layer 4 LVS-NAT setup (http://www.ssi.bg/~ja/L4-NAT-HOWTO.txt). |
LVS-NAT is based on cisco's LocalDirector.
This method was used for the first LVS. If you want to set up a test LVS, this requires no modification of the realservers and is still probably the simplest setup.
In a commercial environment, the owners of servers are loath to change the configuration of a tested machine. When they want load balancing, they will clone their server and tell you to put your load balancer infront of their rack of servers. You will not be allowed near any of their servers, thank you very much. In this case you use LVS-NAT.
Ratz Wed, 15 Nov 2006
Most commercial load balancers are not set up anymore using the triangulation mode (Joe == LVS-DR) (at least in the projects I've been involved). The load balancer is becoming more and more a router, using well-understood key technologies like VRRP and content processing.
With LVS-NAT, the incoming packets are rewritten by the director changing the dst_addr from the VIP to the address of one of the realservers and then forwarded to the realserver. The replies from the realserver are sent to the director where they are rewritten and returned to the client with the source address changed from the RIP to the VIP.
Unlike the other two methods of forwarding used in an LVS (LVS-DR and LVS-Tun) the realserver only needs a functioning tcpip stack (eg a networked printer), i.e. the realserver can have any operating system and no modifications are made to the configuration of the realservers (except setting their route tables).
Sep 2006: Various problems have surfaced in the 2.6.x LVS-NAT code all relating to routing (netfilter) on the side of the director facing the internet. People using LVS-NAT on a director which isn't a firewall and which only has a single default gw, aren't having any problems.
It seems the 2.4.x code was working correctly: Farid Sarwari had it working for IPSec at least. The source routing problem has been identified by three people, who've all submitted functionally equivalent patches. While we're delighted to have contributions from so many people, we regret that we weren't fast enough to recognise the problem and save the last two people all their work. One of the problems (we think) is that not many people are using LVS-NAT and when a weird problem is reported on the mailing list we say "well 1000's of people have been using LVS-NAT for years without this problem, this guy must not know what he's talking about". We're now taking the approach that maybe not too many people are using LVS-NAT.
Here are the problems which have surfaced so far with LVS-NAT. They either have been solved or will be in a future release of LVS.
Note | |
---|---|
If the VIP and the RIPs are on the same network you need the One Network LVS-NAT |
Here the client is on the same network as the VIP (in a production LVS, the client will be coming in from an external network via a router). (The director can have 1 or 2 NICs - two NICs will allow higher throughput of packets, since the traffic on the realserver network will be separated from the traffic on the client network).
Machine IP client CIP=192.168.1.254 director VIP VIP=192.168.1.110 (the IP for the LVS) director internal interface DIP=10.1.1.1 realserver1 RIP1=10.1.1.2 realserver2 RIP2=10.1.1.3 realserver3 RIP3=10.1.1.4 . . realserverN RIPn=10.1.1.n+1 dip DIP=10.1.1.9 (director interface on the LVS-NAT network) |
________ | | | client | |________| CIP=192.168.1.254 | (router) | __________ | | | | VIP=192.168.1.110 (eth0:110) | director |---| |__________| | DIP=10.1.1.9 (eth0:9) | | ----------------------------------- | | | | | | RIP1=10.1.1.2 RIP2=10.1.1.3 RIP3=10.1.1.4 (all eth0) _____________ _____________ _____________ | | | | | | | realserver | | realserver | | realserver | |_____________| |_____________| |_____________| |
here's the lvs.conf for this setup
LVS_TYPE=VS_NAT INITIAL_STATE=on VIP=eth0:110 lvs 255.255.255.0 192.168.1.255 DIP=eth0 dip 192.168.1.0 255.255.255.0 192.168.1.255 DIRECTOR_DEFAULT_GW=client SERVICE=t telnet rr realserver1:telnet realserver2:telnet realserver3:telnet SERVER_NET_DEVICE=eth0 SERVER_DEFAULT_GW=dip #----------end lvs_nat.conf------------------------------------ |
The VIP is the only IP known to the client. The RIPs here are on a different network to the VIP (although with only 1 NIC on the director, the VIP and the RIPs are on the same wire).
In normal NAT, masquerading is the rewriting of packets originating behind the NAT box. With LVS-NAT, the incoming packet (src=CIP,dst=VIP, abbreviated to CIP->VIP) is rewritten by the director (becoming CIP->RIP). The action of the LVS director is called demasquerading. The demasqueraded packet is forwarded to the realserver. The reply packet (RIP->CIP) is generated by the realserver.
For LVS-NAT to work
Forgetting to set this up is the single most common cause of failure when setting up a LVS-NAT LVS.
The original (and the simplest from the point of view of setup) way is to make the DIP (on the director) the default gw for the packets from the realserver. The documentation here all assumes you'll be using this method. (Any IP on the director will do, but in the case where you have two directors in active/backup failover, you have an IP that is moved to the active director and this is called the DIP). Any method of making the return packets go through the director will do. With the arrival of the Policy Routing tools, you can route packets according to any parameter in the packet header (e.g. src_addr, src_port, dest_addr..) Here's an example of ip rules on the realserver to route packets from the RIP to an IP on the director. This avoids having to route these packets via a default gw.
Neil Prockter prockter (at) lse (dot) ac (dot) uk 30 Mar 2004
you can avoid using the director as the default gw by
realserver# echo 80 lvs >> /etc/iproute2/rt_tables realserver# ip route add default <address on director, eg DIP> table lvs realserver# ip rule add from <RIP> table lvsFor the IPs in Virtual Server via NAT (http://www.linuxvirtualserver.org/VS-NAT.html).
echo 80 lvs >> /etc/iproute2/rt_tables ip route add default 172.16.0.1 table lvs ip rule add from 172.16.0.2 table lvsI do this with lvs and with cisco css units
Here Neil is routing packets from RIP to 0/0 via DIP. You can be more restrictive and route packets from RIP:port (where port is the LVS'ed service) to 0/0 via DIP. Packets from RIP:other_ports can be routed via other rules.
For a 2 NIC director (with different physical networks for the realservers and the clients), it is enough for the default gw of the realservers to be the director. For a 1 NIC, two network setup (where the two networks are using the same link layer), in addition, the realservers must only have routes to the director. For a 1 NIC, 1 network setup, ICMP redirects must be turned off on the director (see One Network LVS-NAT) (the configure script does this for you).
In a normal server farm, the default gw of the realserver would be the router to the internet and the packet RIP->CIP would be sent directly to the client. In a LVS-NAT LVS, the default gw of the realservers must be the director. The director masquerades the packet from the realserver (rewrites it to VIP->CIP) and the client receives a rewritten packet with the expected source IP of the VIP.
Note | |
---|---|
the packet must be routed via the director, there must be no other path to the client. A packet arriving at the client directly from the realserver, rather than going through the director, will not be seen as a reply to the client's request and the connection will hang. If the director is not the default gw for the realservers, then if you use tcpdump on the director to watch an attempt to telnet from the client to the VIP (run tcpdump with `tcpdump port telnet`), you will see the request packet (CIP->VIP), the rewritten packet (CIP->RIP) and the reply packet (RIP->CIP). You will not see the rewritten reply packet (VIP->CIP). (Remember if you have a switch on the realserver's network, rather than a hub, then each node only sees the packets to/from it. tcpdump won't see packets to between other nodes on the same network.) |
Part of the setup of LVS-NAT then is to make sure that the reply packet goes via the director, where it will be rewritten to have the addresses (VIP->CIP). In some cases (e.g. 1 net NS-NAT) icmp redirects have to be turned off on the director so that the realserver doesn't get a redirect to forward packets directly to the client.
In a production system, a router would prevent a machine on the outside exchanging packets with machines on the RIP network. As well, the realservers will be on a private network (eg 192.168.x.x/24) and replies will not be routable.
In a test setup (no router), these safeguards don't exist. All machines (client, director, realservers) are on the same piece of wire and if routing information is added to the hosts, the client can connect to the realservers independantly of the LVS. This will stop LVS-NAT from working (your connection will hang), or it may appear to work (you'll be connecting directly to the realserver).
In a test setup, traceroute from the realserver to the client should go through the director (2 hops in the above diagram). The configure script will test that the director's gw is 2 hops from the realserver and that the route to the director's gw is via the director, preventing this error.
(Thanks to James Treleaven jametrel (at) enoreo (dot) on (dot) ca 28 Feb 2002, for clarifying the write up on the ping tests here.)
In a test setup with the client connected directly to the director (in the setup above with 1 or 2 NICs, or the one NIC, one network LVS-NAT setup), you can ping between the client and realservers. However in production, with the client out on internet land, and the realservers with unroutable IPs, you should not be able to ping between the realservers and the client. The realservers should not know about any other network than their own (here 10.1.1.0). The connection from the realservers to the client is through ipchains (for 2.2.x kernels) and LVS-NAT tables setup by the director.
In my first attempt at LVS-NAT setup, I had all machines on a 192.168.1.0 network and added a 10.1.1.0 private network for the realservers/director, without removing the 192.168.1.0 network on the realservers. All replies from the servers were routed onto the 192.168.1.0 network rather than back through LVS-NAT and the client didn't get any packets back.
Here's the general setup I use for testing. The client (192.168.2.254) connects to the VIP on the director. (The VIP on the realserver is present only for LVS-DR and LVS-Tun.) For LVS-DR, the default gw for the realservers is 192.168.1.254. For LVS-NAT, the default gw for the realservers is 192.168.1.9.
____________ | |192.168.1.254 (eth1) | client |---------------------- |____________| | CIP=192.168.2.254 (eth0) | | | | | VIP=192.168.2.110 (eth0) | ____________ | | | | | director | | |____________| | DIP=192.168.1.9 (eth1, arps) | | | (switch)------------------------ | RIP=192.168.1.2 (eth0) VIP=192.168.2.110 (for LVS-DR, lo:0, no_arp) _____________ | | | realserver | |_____________| |
This setup works for both LVS-NAT and LVS-DR.
Here's the routing table for one of the realservers as in the LVS-NAT setup.
realserver:# netstat -rn Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.1.0 0.0.0.0 255.255.255.0 U 40 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo 0.0.0.0 192.168.1.9 0.0.0.0 UG 40 0 0 eth0 |
Here's a traceroute from the realserver to the client showing 2 hops.
traceroute to client2.mack.net (192.168.2.254), 30 hops max, 40 byte packets 1 director.mack.net (192.168.1.9) 1.089 ms 1.046 ms 0.799 ms 2 client2.mack.net (192.168.2.254) 1.019 ms 1.16 ms 1.135 ms |
Note the traceroute from the client box to the realserver only has one hop.
director icmp redirects are on, but the director doesn't issue a redirect (see icmp redirects) because the packet RIP->CIP from the realserver emerges from a different NIC on the director than it arrived on (and with different source IP). The client machine doesn't send a redirect since it is not forwarding packets, it's the endpoint of the connection.
Use lvs_nat.conf as a template (sample here will setup LVS-NAT in the diagram above assuming the realservers are already on the network and using the DIP as the default gw).
#--------------lvs_nat.conf---------------------- LVS_TYPE=VS_NAT INITIAL_STATE=on #director setup: VIP=eth0:110 192.168.1.110 255.255.255.0 192.168.1.255 DIP=eth0:10 10.1.1.10 10.1.1.0 255.255.255.0 10.1.1.255 #Services on realservers: #telnet to 10.1.1.2 SERVICE=t telnet wlc 10.1.1.2:telnet #http to a 10.1.1.2 (with weight 2) and to high port on 10.1.1.3 SERVICE=t 80 wlc 10.1.1.2:http,2 10.1.1.3:8080 10.1.1.4 #realserver setup (nothing to be done for LVS-NAT) #----------end lvs_nat.conf------------------------------------ |
The output is a commented rc.lvs_nat file. Run the rc.lvs_nat file on the director and then the realservers (the script knows whether it is running on a director or realserver).
The configure script will setup up masquerading, forwarding on the director and the default gw for the realservers.
The packets coming in from the client are being demasqueraded by the director.
In 2.2.x you need to masquerade the replies. Here's the masquerading code in rc.lvs_nat, that runs on the director (produced by the configure script).
echo "turning on masquerading " #setup masquerading echo "1" >/proc/sys/net/ipv4/ip_forward echo "installing ipchain rules" /sbin/ipchains -A forward -j MASQ -s 10.1.1.2 http -d 0.0.0.0/0 #repeated for each realserver and service .. .. echo "ipchain rules " /sbin/ipchains -L |
In this example, http is being masqueraded by the director, allowing the realserver to reply to the telnet requests from the director being demasqueraded by the director as part of the 2.2.x LVS code.
In 2.4.x masquerading of LVS'ed services is done explicitely by the LVS code and no extra masquerading (by iptables) commands need be run.
One of the features of LVS-NAT is that you can rewrite/re-map the ports. Thus the client can connect to VIP:http, while the realserver can be listening on some other port (!http). You set this up with ipvsadm
Here the client connects to VIP:http, the director rewrites the packet header so that dst_addr=RIP:9999 and forwards the packet to the realserver, where the httpd is listening on RIP:9999.
director:/# /sbin/ipvsadm -a -t VIP:http -r RIP:9999 -m -w 1 |
For each realserver (i.e. each RIP) you can rewrite the ports differently: each realserver could have the httpd listening on it's own particular port (e.g. RIP1:9999, RIP2:80, RIP3:xxxx).
Although port re-mapping is not possible with LVS-DR or LVS-Tun, it's possible to use iptables to do Re-mapping ports with LVS-DR (and LVS-Tun) on the realserver, producing the same result.
For the earlier versions of LVS-NAT (with 2.0.36 kernels) the timeouts were set by linux/include/net/ip_masq.h, the default values of masquerading timeouts are:
#define MASQUERADE_EXPIRE_TCP 15*16*Hz #define MASQUERADE_EXPIRE_TCP_FIN 2*16*Hz #define MASQUERADE_EXPIRE_UDP 5*16*Hz |
Julian has his latest fool-proof setup doc at Julian's software page. Here's the version, at the time I wrote this entry.
Q.1 Can the realserver ping client? rs# ping -n client A.1 Yes => good A.2 No => bad Some settings for the director: Linux 2.2/2.4: ipchains -A forward -s RIP -j MASQ Linux 2.4: iptables -t nat -A POSTROUTING -s RIP -j MASQUERADE Q.2 Traceroute to client goes through LVS box and reaches the client? traceroute -n -s RIP CLIENT_IP A.1 Yes => good A.2 No => bad same ipchains command as in Q.1 For client and server on same physical media use these in the director: echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects Q.3 Is the traffic forwarded from the LVS box, in both directions? For all interfaces on director: tcpdump -ln host CLIENT_IP The right sequence, i.e. the IP addresses and ports on each step (the reversed for the in->out direction are not shown): CLIENT | CIP:CPORT -> VIP:VPORT | || | \/ out | CIP:CPORT -> VIP:VPORT || LVS box \/ | CIP:CPORT -> RIP:RPORT in | || | \/ | CIP:CPORT -> RIP:RPORT + REAL SERVER A.1 Yes, in both directions => good (for Layer 4, probably not for L7) A.2 The packets from the realserver are dropped => bad: - rp_filter protection on the incoming interface, probably hit from local client (for more info on rp_filter, see the section on <xref linkend="proc_filesystem"/> - firewall rules drop the replies A.3 The packets from the realservers leave the director unchanged - missing -j MASQ ipchains rule in the LVS box For client and server on same physical media: The packets simply does not reach the director. The real server is ICMP redirected to the client. In director: echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects A.4 All packets from the client are dropped - the requests are received on wrong interface with rp_filter protection - firewall rules drop the requests A.5 The client connections are refused or are served from service in the LVS box - client and LVS are on same host => not valid - the packets are not marked from the firewall and don't hit firewall mark based virtual service Q.4 Is the traffic replied from the realserver? For the outgoing interface on realserver: tcpdump -ln host CLIENT_IP A.1 Yes, SYN+ACK => good A.2 TCP RST => bad, No listening real service A.3 ICMP message => bad, Blocked from Firewall/No listening service A.4 The same request packet leaves the realserver => missing accept rules or RIP is not defined A.5 No reply => realserver problem: - the rp_filter protection drops the packets - the firewall drops the request packets - the firewall drops the replies A.6 Replies goes through another device or don't go to the LVS box =? bad - the route to the client is direct and so don't pass the LVS box, for example: - client on the LAN - client and realserver on same host - wrong route to the LVS box is used => use another Check the route: rs# ip route get CLIENT_IP from RIP The result: start the following tests rs# tcpdump -ln host CIP rs# traceroute -n -s RIP CIP lvs# tcpdump -ln host CIP client# tcpdump -ln host CIP For more deep problems use tcpdump -len, i.e. sometimes the link layer addresses help a bit. For FTP: VS-NAT in Linux 2.2 requires: - modprobe ip_masq_ftp (before 2.2.19) - modprobe ip_masq_ftp in_ports=21 (2.2.19+) VS-NAT in Linux 2.4 requires: - ip_vs_ftp VS-DR/TUN require persistent flag FTP reports with debug mode enabled are useful: # ftp ftp> debug ftp> open my.virtual.ftp.service ftp> ... ftp> dir ftp> passive ftp> dir There are reports that sometimes the status strings reported from the FTP realservers are not matched with the string constants encoded in the kernel FTP support. For example, Linux 2.2.19 matches "227 Entering Passive Mode (xxx,xxx,xxx,xxx,ppp,ppp)" Julian Anastasov |
director:/etc/lvs# ipvsadm does the following
#setup connection for telnet, using round robin director:/etc/lvs# /sbin/ipvsadm -A -t 192.168.1.110:23 -s rr #connections to x.x.x.110:telnet are sent to # realserver 10.1.1.2:telnet #using LVS-NAT (the -m) with weight 1 director:/etc/lvs# /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.2:23 -m -w 1 #and to realserver 10.1.1.3 #using LVS-NAT with weight 2 director:/etc/lvs# /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.3:23 -m -w 2 |
(if the service was http instead of telnet, the webserver on the realserver could be listening on port 8000 instead of 80)
Turn on ip_forwarding (so that the packets can be forwarded to the realservers)
director:/etc/lvs# echo "1" > /proc/sys/net/ipv4/ip_forward |
Example: client requests a connection to 192.168.1.110:23
director chooses realserver 10.1.1.2:23, updates connection tables, then
packet source dest incoming CIP:3456 VIP:23 inbound rewriting CIP:3456 RIP1:23 reply (routed to DIP) RIP1:23 CIP:3456 outbound rewriting VIP:23 CIP:3456 |
The client gets back a packet with the source_address = VIP.
For the verbally oriented...
The request packet is sent to the VIP. The director looks up its tables and sends the connection to realserver1. The packet is rewritten with a new destination (in this case with the same port, but the port could be changed too) and sent to RIP1. The realserver replies, sending back a packet to the client. The default gw for the realserver is the director. The director accepts the packet and rewrites the packet to have source=VIP and sends the rewritten packet to the client.
Why isn't the source of the incoming packet rewritten to be the DIP or VIP?
Wensong
...changing the source of the packet to the VIP sounds good too, it doesn't require that default route rule, but requires additional code to handle it.
Note | |
---|---|
This was written for 2.0.x and 2.2.x kernel LVSs which was based on the masquerading code. With 2.4.x, LVS is based on netfilter and there were initially some problems getting LVS-NAT to work with 2.4.x. What happens here for 2.4.x, I don't know. |
Joe
In normal NAT, where a bunch of machines are sitting behind a NAT box, all outward going packets are given the IP on the outside of the NAT box. What if there are several IPs facing the outside world? For NAT it doesn't really matter as long as the same IP is used for all packets. The default value is usually the first interface address (eg eth0). With LVS-NAT you want the outgoing packets to have the source of the VIP (probably on eth0:1) rather than the IP on the main device on the director (eth0).
With a single realserver LVS-NAT LVS serving telnet, the incoming packet does this,
CIP:high_port -> VIP:telnet #client sends a packet CIP:high_port -> RIP:telnet #director demasquerades packet, forwards to realserver RIP:telnet -> CIP:high_port #realserver replies |
The reply arrives on the director (being sent there because the director is the default gw for the realserver). To get the packet from the director to the client, you have to reverse the masquerading done by the LVS. To do this (in 2.2 kernels), on the director you add an ipchains rule
director:# ipchains -A forward -p tcp -j MASQ -s realserver1 telnet -d 0.0.0.0/0 |
If the director has multiple IPs facing the outside world (eg eth0=192.168.2.1 the regular IP for the director and eth0:1=192.168.2.110 the VIP), the masquerading code has to choose the correct IP for the outgoing packet. Only the packet with src_addr=VIP will be accepted by the client. A packet with any other scr_addr will be dropped. The normal default for masquerading (eth0) should not be used in this case. The required m_addr (masquerade address) is the VIP.
Does LVS fiddle with the ipchains tables to do this?
Julian Anastasov ja (at) ssi (dot) bg 01 May 2001
No, ipchains only delivers packets to the masquerading code. It doesn't matter how the packets are selected in the ipchains rule.
The m_addr (masqueraded_address) is assigned when the first packet is seen (the connect request from the client to the VIP). LVS sees the first packet in the LOCAL_IN chain when it comes from the client. LVS assigns the VIP as maddr.
The MASQ code sees the first packet in the FORWARD chain when there is a -j MASQ target in the ipchains rule. The routing selects the m_addr. If the connection already exists the packets are masqueraded.
The LVS can see packets in the FORWARD chain but they are for already created connections, so no m_addr is assigned and the packets are masqueraded with the address saved in the connections structure (the VIP) when it was created.
There are 3 common cases:
- The connection is created as response to packet.
- The connection is created as response to packet to another connection.
- The connection is already created
Case (1) can happen in the plain masquerading case where the in->out packets hit the masquerading rule. In this case when nobody recommends the s_addr for the packets going to the external side of the MASQ, the masq code uses the routing to select the m_addr for this new connection. This address is not always the DIP, it can be the preferred source address for the used route, for example, address from another device.
Case (1) happens also for LVS but in this case we know:
- the client address/port (from the received datagram)
- the virtual server address/port (from the received datagram)
- the realserver address/port (from the LVS scheduler)
But this is on out->in packet and we are talking about in->out packets
Case (2) happens for related connections where the new connection can be created when all addresses and ports are known or when the protocol requires some wildcard address/port matching, for example, ftp. In this case we expect the first packet for the connection after some period of time.
It seems you are interested how case (3) works. The answer is that the NAT code remembers all these addresses and ports in a connection structure with these components
- external address/port (LVS: client)
- masquerading address/port (LVS: virtual server)
- internal address/port (LVS: realserver)
- protocol
- etc
LVS and the masquerading code simply hook in the packet path and they perform the header/data mangling. In this process they use the information from the connection table(s). The rule is simple: when a packet is already for established connection we must remember all addresses and ports and always to use same values when mangling the packet header. If we select each time different addresses or ports we simply break the connection. After the packet is mangled the routing is called to select the next hop. Of course, you can expect problems if there are fatal route changes.
So, the short answer is: the LVS knows what m_addr to use when a packet from the realserver is received because the connection is already created and we know what addresses to use. Only in the masquerading case (where LVS os not involved) connections can be created and a masquerading address to be selected without using rule for this. In all other cases there is a rule that recommends what addresses to be used at creation time. After creation the same values are used.
Wayne wayne (at) compute-aid (dot) com 26 Apr 2000
Any web server behind the LVS box use LVS-NAT can initiate communication to the Internet. However, it is not using the farm IP address, rather it is using the masquerading IP address -- the actual IP address of the interface. Is there easy way to let the server in NAT mode to go out as the farm IP address?
Lars
No. This is a limitation in the 2.2 masquerading code. It will always use the first address on the interface.
We tried and it works! We put VIP on eth0, and RIP on eth0:1 in NAT mode and it works fine. Just need to figure out how to do it during reboot, since this is done by playing with ifconfigure command. Once we swap them around, the going out IP address is the VIP address. But if LVS box reboot, you just have to redo it again.
Joe:
! :-) I didn't realise you were in VS-NAT mode, therefore not having the VIP on the realservers. I thought you must be in VS-DR.
The disadvantage of the 2 network LVS-NAT is that the realservers are not able to connect to machines in the network of the VIP. You couldn't make a LVS-NAT setup out of machines already on your LAN, which were also required for other purposes to stay on the LAN network.
Here's a one network LVS-NAT LVS.
________ | | | client | |________| CIP=192.168.1.254 | | __________ | | | | VIP=192.168.1.110 (eth0:110) | director |---| |__________| | DIP=192.168.1.9 (eth0:9) | | ------------------------------------ | | | | | | RIP1=192.168.1.2 RIP2=192.168.1.3 RIP3=192.168.1.4 (all eth0) _____________ _____________ _____________ | | | | | | | realserver | | realserver | | realserver | |_____________| |_____________| |_____________| |
The problem:
A return packet from the realserver (with address RIP->CIP) will be sent to the realserver's default gw (the director). What you want is for the director to accept the packet and to demasquerade it, sending it on to the client as a packet with address (VIP->CIP). With ICMP redirects on, the director will realise that there is a better route for this packet, i.e. directly from the realserver to the client and will send an ICMP redirect to the realserver, informing it of the better route. As a result, the realserver will send subsequent packets directly to the client and the reply packet will not be demasqueraded by the director. The client will get a reply from the RIP rather than the VIP and the connection will hang.
The cure:
Thanks to michael_e_brown (at) dell (dot) com and Julian ja (at) ssi (dot) bg for help sorting this out.
To get a LVS-NAT LVS to work on one network -
On the director, turn off icmp redirects on the NIC that is the default gw for the realservers. (Note: eth0 may be eth1 etc, on your machine).
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects |
Make the director the default and only route for outgoing packets.
You will probably have set the routing on the realserver up like this
realserver:/etc/lvs# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 director 0.0.0.0 UG 0 0 0 eth0 |
Note the route to 192.168.1.0/24. This route allows the realserver to send packets to the client by just putting them out on eth0, where the client will pick them up directly (without being demasqueraded) and the LVS will not work. This route also allows the realservers to talk to each other directly i.e. without routing packets through the director. (As the admin, you might want to telnet from one realserver to another, or you might have ntp running, sending ntp packets between realservers.)
Remove the route to 192.168.1.0/24.
realserver:/etc/lvs#route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0 |
This will leave you with
realserver:/etc/lvs# netstat -r Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 director 0.0.0.0 UG 0 0 0 eth0 |
Now packets RIP->CIP have to go via the director and will be demasqueraded. The LVS-NAT LVS now works. If LVS is forwarding telnet, you can telnet from the client to the VIP and connect to the realserver. As a side effect, packets between the realservers are also routed via the director, rather than going directly (note: all packets now go via the director). (You can live with that.)
You can ping from the client to the realserver.
You can also connect _directly_ to services on the realserver _NOT_ being forwarded by LVS (in this case e.g. ftp).
You can no longer connect directly to the realserver for services being forwarded by the LVS. (In the example here, telnet ports are not being rewritten by the LVS, i.e. telnet->telnet).
client:~# telnet realserver Trying 192.168.1.11... ^C (i.e. connection hangs) |
Here's tcpdump on the director. Since the network is switched the director can't see packets between the client and realserver. The client initiates telnet. `netstat -a` on the client shows a SYN_SENT from port 4121.
director:/etc/lvs# tcpdump tcpdump: listening on eth0 16:37:04.655036 realserver.telnet > client.4121: S 354934654:354934654(0) ack 1183118745 win 32120 <mss 1460,sackOK,timestamp 111425176[|tcp]> (DF) 16:37:04.655284 director > realserver: icmp: client tcp port 4 121 unreachable [tos 0xc0] |
(repeats every second until I kill telnet on client)
The director doesn't see the connect request from client->realserver. The first packet seen is the ack from the realserver, which will be forwarded via the director. The director will rewrite the ack to be from the director. The client will not accept an ack to port 4121 from director:telnet.
Julian 2001-01-12
The redirects are handled in net/ipv4/route.c:ip_route_input_slow(), i.e. from the routing and before reaching LVS (in LOCAL_IN):
if (out_dev == in_dev && err && !(flags&(RTCF_NAT|RTCF_MASQ)) && (IN_DEV_SHARED_MEDIA(out_dev) || inet_addr_onlink(out_dev, saddr, FIB_RES_GW(res)))) flags |= RTCF_DOREDIRECT; |
Here RTCF_NAT && RTCF_MASQ are flags used from the dumb nat code but the masquerading defined with ipchains -j MASQ does not set such or some of these flags. The result: the redirect is sent according to the conf/{all,<device>}/send_redirects from ip_rt_send_redirect() and ip_forward() from net/ipv4/ip_forward.c. So, the meaning is: if we are going to forward packet and the in_dev is same as out_dev we redirect the sender to the directly connected destination which is on the same shared media. The ipchains code in the FORWARD chain is reached too late to avoid sending these redirects. They are already sent when the -j MASQ is detected.
If all/send_redirects is 1 every <device>/send_redirects is ignored. So, if we leave it 1 redirects are sent. To stop them we need all=0 && <device>=0. default/send_redirects is the value that will be inherited from each new interface that is created.
The logical operation between conf/all/<var> and conf/<device>/<var> is different for each var. The used operation is specified in /usr/src/linux/include/linux/inetdevice.h
For send_redirects it is '||'. For others, for example for conf/{all,<device>}/hidden), it is '&&'
So, for the two logical operations we have:
For &&: all <dev> result ------------------------------ 0 0 0 0 1 0 1 0 0 1 1 1 For ||: all <dev> result ------------------------------ 0 0 0 0 1 1 1 0 1 1 1 1 |
When a new interface is created we have two choices:
1. to set conf/default/<var> to the value that we want each new created interface to inherit
2. to create the interface in this way:
ifconfig eth0 0.0.0.0 up |
and then to set the value before assigning the address:
echo <val> > conf/eth0/<var> ifconfig eth0 192.168.0.1 up |
but this is risky especially for the tunnel devices, for example, if you want to play with var rp_filter.
For the other devices this is a safe method if there is no problem with the default value before assigning the IP address. The first method can be the safest one but you have to be very careful.
Joe Stump joe (at) joestump (dot) net 2002-09-04
The problem is you have one network that has your realservers, directors, and clients all together on the same class C. For this example we will say they all sit on 192.168.1.*. Here is a simple layout.
~~~~~~~~~~~~~ { Internet }------------------------+ ~~~~~~~~~~~~~ | | | IP: 192.168.1.1 | External IP: 166.23.3.4 | | | +---------------+ +---------+ | Director |-------------------| Gateway | +---------------+ +---------+ | | | Internal IP: 192.168.1.25 | | | +----------+ | | | | IP: 192.168.1.200 +--------+ | | Client | +---------------+ +--------+ | Real Server | IP: 192.168.1.34 +---------------+ |
Everything looks like it should work just fine right? Wrong. The problem is that in reality all of these machines are able to talk to one another because they all reside on the same physical network. So here is the problem: clients outside of the internal network get expected output from the load balancer, but clients on the internal network hang when connecting to the load balancers.
So what is causing this problem? The routing tables on the directors and the realservers are causing your client to become confused and hang the connection. If you look at your routing tables on your realserver you will notice that the default gatway for your internal network is 0.0.0.0. Your director will have a similar route. These routes tell your directors and realservers that requests coming from machines on that network should be routed directly back to that machine. So when a request comes to the director the director routes it to the realserver, but the realserver sends the response directly back to the client instead of routing it back through the director as it should. The same thing happens when you try to connect via the director's outside IP from an internal client IP, only this time the director mistakenly sends directly to the internal client IP. The internal client IP is expecting the return packets from the director's external IP, not the director's internal IP.
The solution is simple. Delete the default routes on your directors and real servers to the internal network.
route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0 |
The above line should do the trick. One thing to note is that you will not be able to connect to these machines once you have deleted these routes. Y0u might just want to use the director as a terminal server since you can connect from there to the realservers.
Also, if you have your realservers connect to DB's and NFS servers on the internal network you will have to add direct routes to those hosts. You do this by typing this:
route add -host $SERVER dev eth0 |
I added these routes to a startup script so it kills my internal routes and adds the needed direct routes to my NFS and DB server during startup.
Here's an untested solution from Julian for a one network LVS-NAT (I assume this is old, maybe 1999, because I don't have a date on it).
put the client in the external logical network. By this way the client, the director and the realserver(s) are on same physical network but the client can't be on the masqueraded logical network. So, change the client from 192.168.1.80 to 166.84.192.80 (or something else). Don't add through DIP (I don't see such IP for the Director). Why in your setup DIP==VIP ? If you add DIP (166.84.192.33 for example) in the director you can later add path for 192.168.1.0/24 through 166.84.192.33. There is no need to use masquerading with 2 NICs. Just remove the client from the internal logical network used by the LVS cluster.
A different working solution from Ray Bellis rpb (at) community (dot) net (dot) uk
the same *logical* subnet. I still have a dual-ethernet box acting as a director, and the VIP is installed as an alias interface on the external side of the director, even though the IP address it has is in fact assigned from the same subnet as the
Ray Bellis rpb (at) community (dot) net (dot) uk has used a 2 NIC director to have the RIPs on the same logical network as the VIP (ie RIP and VIP numbers are from the same subnet), although they are in different physical networks.
For LVS-NAT, the packet headers are re-written (from the VIP to the RIP and back again). At no extra overhead, anything else in the header can be rewritten at the same time. LVS-NAT can rewrite the ports Thus a request to port VIP:80 received on the director can be sent to RIP:8000 on the realserver.
In the 2.0.x and 2.2.x series of IPVS, rewriting the packet headers is slow on machines from that era (60usec/packet on a pentium classic) and limits the throughput of LVS-NAT (for 536byte packets, this is 72Mbit/sec or about 100BaseT). While LVS-NAT throughput does not scale well with the packet rate (after you run out of CPU), the advantage of LVS-NAT is that realservers can have any OS, no modifications are needed to the realserver to run it in an LVS, and the realserver can have services not found on Linux boxes.
Note | |
---|---|
For Local Node, headers are not rewritten. |
The LVS-NAT code for 2.4 is rewritten as a Netfilter modules and is not detectably slower than LVS-DR or LVS-Tun. (The IPVS code for the early 2.4.x kernels in 2001 was buggy during the changeover, but that is all fixed now.)
from Horms, Jul 2005
With LVS-DR or LVS-Tun, the packet arrives on the realserver with dst_addr=VIP:port. Thus even if you set up two RIPs on the realserver you cannot have two instances of the service demon, because they would both have to be listening for VIP:port. With LVS-NAT, you could
Horms
All things are relative. LVS-NAT is actually pretty fast. I have seen it do well over 600Mbit/s. But in theory LVS-DR is always going to be faster because it does less work. If you only have 100Mbit/s on your LAN then either will be fine. If you have gigabit then LVS-NAT will still probably be fine. Beyond that... I am not sure if anyone has tested that to see what will happen. In terms of number of connections, there is a limit with LVS-NAT that relates to the number of ports. But in practice you probably won't reach that limit anyway.
With the slower machines around in the early days of LVS, the throughput of LVS-NAT was limited by the time taken by the director to rewrite a packet. The limit for a pentium classic 75MHz is about 80Mbit/sec (100baseT). Since the director is the limiting step, increasing the number of realservers does not increase the throughput.
The performance page shows a slightly higher latency with LVS-NAT compared to LVS-DR or LVS-Tun, but the same maximum throughput. The load average on the director is high (>5) at maximum throughput, and the keyboard and mouse are quite sluggish. The same director box operating at the same throughput under LVS-DR or LVS-Tun has no perceptable load as measured by top or by mouse/keyboard responsiveness.
Wayne
NAT taks some CPU and memory copying. With a slower CPU, it will be slower.
Julian Anastasov ja (at) ssi (dot) bg 19 Jul 2001
This is a myth from the 2.2 age. In 2.2 there are 2 input route calls for the out->in traffic and this reduces the performance. By default, in 2.2 (and 2.4 too) the data is not copied when the IP header is changed. Updating the checksum in the IP header does not cost too much time compared to the total packet handling time.
To check the difference between the NAT and DR forwarding method in out->in direction you can use testlvs from http://www.ssi.bg/~ja/ and to flood a 2.4 director in 2 setups: DR and NAT. My tests show that I can't see a visible difference. We are talking about 110,000 SYN packets/sec with 10 pseudo clients and same cpu idle during the tests (there is not enough client power in my setup for full test), 2 CPUx 866MHz, 2 100mbit internal i82557/i82558 NICs, switched hub:
3 testlvs client hosts -> NIC1-LVS-NIC2 -> packets/sec.
I use small number of clients because I don't want to spend time in routing cache or LVS table lookups.
Of course, the NAT involves in->out traffic and this can reduce twice the performance if the CPU or the PCI power is not enough to handle the traffic in both directions. This is the real reason the NAT method to look so slow in 2.4. IMO, the overhead from the TUN encapsulation or from the NAT process is negliable.
Here come the surprises:
The basic setup: 1 CPU PIII 866MHz, 2 NICs (1 IN and 1 OUT), LVS-NAT, SYN flood using testlvs with 10 pseudo clients, no ipchains rules. Kernels: 2.2.19 and 2.4.7pre7.
Linux 2.2 (with ipchains support, with modified demasq path to use one input routing call, something like LVS uses in 2.4 but without dst cache usage):
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 99% (strange) In 110,000 SYNs/sec, Out 88,000 SYNs/sec, CPU idle: 0% |
Linux 2.4 (with ipchains support): with 3-4 ipchains rules:
In 80,000 SYNs/sec, Out 55,000 SYNs/sec, CPU idle: 0% In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 0% In 110,000 SYNs/sec, Out 63,000 SYNs/sec (strange), CPU idle: 0% |
Linux 2.4 (without ipchains support):
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 20% In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 2% |
Linux 2.4, 2 CPU (with ipchains support):
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 30% In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 0% |
Linux 2.4, 2 CPU (without ipchains support):
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 45% In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 15%, 30000 ctxswitches/sec |
What I see is that:
modified 2.2 and 2.4 UP look equal on 80,000P/s
limits: 2.2=88,000P/s, 2.4=96,000P/s, i.e. 8% difference
1 and 2 CPU in 2.4 look equal 110,000->96,000 (100mbit or PCI bottleneck?), may be we can't send more that 96,000P/s through 100mbit NIC?
I performed other tests, testlvs with UDP flood. The packet rate is lower, the cpu idle time in the LVS box was increased dramatically but the client hosts show 0% cpu idle, may be more testlvs client hosts are needed.
Julian Anastasov ja (at) ssi (dot) bg 16 Jan 2002
Many people think that the packet mangling is evil in the NAT processing. The picture is different: the NAT processing in 2.2 uses 2 input routing calls instead of 1 and this totally kills the forwarding of packets from/to many destinations. Such problems are mostly caused from the bad hash function used in the routing code and because the routing cache has hard limit for entries. Of course, the NAT setups handle more traffic than the other forwarding methods (both the forward and reply directions), a good reason to avoid LVS-NAT with a low power director. In 2.4 the difference between the DR and NAT processing in out->in direction can not be noticed (at least in my tests) because only one route call is used, for all methods.
Matthew S. Crocker Jul 26, 2001
DR is faster, less resource intensive but has issues with configuration because of the age old 'arp problem'
Horms horms (at) vergenet (dot) net
LVS-NAT is still fast enough for many aplications and is IMHO considerably easier to set up. While I think LVS-DR is great I don't think people should be under the impresion that LVS-NAT will intrisicly be a limiting factor to them.
Don Hinshaw dwh (at) openrecording (dot) com 04 Aug 2001
Cisco, Alteon and F5 solutions are all NAT based. The real limiting factor as I understand it is the capacity of the netcard, which these three deal with by using gigabit interfaces.
Julian Anastasov ja (at) ssi (dot) bg 05 Mar 2002 in discussion with Michael McConnell
Note that I used a modified demasq path which uses one input route for NAT but it is wrong. It only proves that 2.2 can reach the same speed as 2.4 if there was use_dst analog in 2.2. Without such feature the difference is 8%. OTOH, there is a right way to implement one input route call as in 2.4 but it includes rewriting of the 2.2 input processing.
Michael McConnell
From what I see here, it looks as though the 2.2 kernel handles a higher numberof SYN's better than the 2.4 kernel. Am I to asume, that the for the 110,000SYNs/sec in the 2.4 kernel, only 63,000 SYNs/sec were answers? The rest failed?
In this test 2.4 has firewall rules, while 2.2 has only ipchains enabled.
Is the 2.2 kernel better at answer a higher number of requests?
No. Note also that the testlvs test was only in one direction, no replies, only client->director->realserver
has anyone compared iptables/ipchains, via 2.2/2.4?
here are my results There is some magic in these tests, I don't know at one place why netfilter shows such bad results. Maybe someone can point to me to the problem.
This originally described how I debugged setting up a one-net LVS-NAT LVS using the output of route. Since it is more about networking tools than LVS-NAT it has been moved to the section on Policy Routing.
If you connect directly to the realserver in a LVS-NAT LVS, the reply packet will be routed through the director, which will attempt to masquerade it. This packet will not be part of an established connection and will be dropped by the director, which will issue an ICMP error.
Paul Wouters paul (at) xtdnet (dot) nl 30 Nov 2001
It would like to reach all LVS'ed services on the realservers directly, i.e. without going through the LVS-NAT director, say from a local client not on the internet.
Connecting from client to a RIP should just completely bypass all the lvs code, but it seems that the lvs code is confused, and thinks a RIP->client answer should be part of its NAT structure.
tcpdump running on internal interface of the director shows a packet from the client received on the RIP; the RIP replies (never reaches the client, the director drops it). The director then sends out a port unreachable:
Julian
The code that replies with an ICMP error can be removed but then you still have the problem of reusing connections. The local_client can select a port for direct connection with the RIP but if that port was used some seconds before for a CIP->VIP connection, it is possible that LVS to catch these replies as part of the previous connection. LVS does not inspect the TCP headers and does not accurately keep the TCP state. So, it is possible that LVS will not to detect that the local_client and the realserver have established a new connection with the same addresses and ports that are still known as NAT connection. Even stateful conntracking can't notice it because the local_clientIP->RIP packets are not subject to NAT processing. When LVS sees the replies from RIP to local_clientIP it will SNAT them and this will be fatal because the new connection is between the local_clientIP and RIP directly, not from CIP->VIP->RIP. The other thing is that CIP even does not know that it connects from same port to same server. It thinks there are 2 connections from same CPORT: to VIP and to RIP, so they can live even at the same time.
But a proper TCP/IP stack on a client will not re-use the same port that quickly, unless it is REALLY loaded with connections right? And a client won't (can't?) use the same source port to different destinations (VIP and RIP) right? So, the problem becomes almost theoretical?
This setup is dangerous. As for the ICMP replies, they are only for anti-DoS purposes but may be are going to die soon. There is still no enough reason to remove that code (it was not first priority).
Or make it switachable as #ifdef or /proc sysctl?
Wensong
Just comment out the whole block, for example,
#if 0 if (ip_vs_lookup_real_service(iph->protocol, iph->saddr, h.portp[0])) { /* * Notify the realserver: there is no existing * entry if it is not RST packet or not TCP packet. */ if (!h.th->rst || iph->protocol != IPPROTO_TCP) { icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); kfree_skb(skb); return NF_STOLEN; } } #endif |
This works fine. Thanks
The topic came up again. Here's another similar reply.
I've set up a small LVS_NAT-based http load balancer but can't seem to connect to the realservers behind them via IP on port 80. Trying to connect directly to the realservers on port 80, though, translates everything correctly, but generates an ICMP port unreach.
Ben North ben (at) antefacto (dot) com 06 Dec 2001
The problem is that LVS takes an interest in all packets with a source IP:port of a Real Service's IP:port, as they're passing through the FORWARD block. This is of course necessary --- normally such packets would exist because of a connection between some client and the Virtual Service, mapped by LVS to some Real Service. The packets then have their source address altered so that they're addressed VIP:VPort -> CIP:CPort.
However, if some route exists for a client to make connections directly to the Real Service, then the packets from the Real Service to the client will not be matched with any existing LVS connection (because there isn't one). At this point, the LVS NAT code will steal the packet and send the "Port unreachable" message you've observed back to the Real Server. A fix to the problem is to #ifdef out this code --- it's in ip_vs_out() in the file ip_vs_core.c.
You might want a client on the realserver (i.e. a process unrelated to the services being LVS'ed) e.g. telnet, to connect to the outside world. See clients on realservers.
The LVS-mini-HOWTO states that the lvs client cannot be on the director or any of the realservers, i.e. that you need an outside client. This restriction can be relaxed under some conditions.
This came from a posting by Jacob Reif Jacob (dot) Rief (at) Tiscover (dot) com 25 Apr 2003.
It is common to run multiple websites (Jacob has 100s) on the same IP, using name based http to differentiate the websites. Sometimes webdesigners use some kind of include-function to include content from one website into another, by means of server-side-includes. (see http://www.php.net/manual/en/function.require.php) using http-subrequests. The include requires a client process running on the webserver, to make a request to a different website on the same IP. If the website is running on an LVS, then the realservers need to be able to make a request to the VIP. For LVS-DR and LVS-Tun this is no problem: the realserver has the VIP (and the services presented on that IP), so requests by http clients running on the realserver to the VIP, will be answered locally.
For LVS-NAT, the services are all running on the RIP (remember, there is no IP with the VIP on realservers for LVS-NAT). Here's what happens when the client on the realserver requests a page at VIP:80
realserver_1 makes a request to VIP:80, which goes to the director. The director demasquerades (rewrites) dst_addr from VIP to RIP_2. realserver_2 then services the request and fires off a reply packet with src_addr=RIP_2, dst_addr=RIP_1. This goes to realserver_1 directly (rather than being masqueraded through the director), but realserver_1 refuses the packet because it expected a reply from VIP and not from RIP_2.
+-------------+ | VIP | | director | +-------------+ ^ | | |req |req v +-------------+ +-------------+ | RIP_1 |<--- | RIP_2 | | Realserver | ans | Realserver | | = client | wer | = server | +-------------+ +-------------+ |
Here are the current attempts at solutions to the problem, or you can go straight to Jacob's solution
Julian's solution removes the local routing (as done for one network LVS-NAT) and forces every packet to pass through the director. The director therefore masquerades (rewrites) src_addr=RIP_2 to VIP and realserver_1 accepts the request. This puts extra netload onto the director.
+-------------+ | <vip> | | director | +-------------+ |^ |^ ans|| req||ans v|req v| +-------------+ +-------------+ | <rip1> | | <rip2> | | Realserver | | Realserver | | = client | | = server | +-------------+ +-------------+ |
Jacob's solution: The solution proposed here does not put that extra load onto the director. However each realserver always contacts itself (which isn't a problem). Put the following entry into each realserver. Now the realservers can access the httpd on RIP as if it were on VIP.
realserver# iptables -t nat -A OUTPUT -p tcp -d $VIP --dport 80 -j DNAT --to ${RIP}:80 |
Carlos Lozano clozano (at) andago (dot) com 02 Jul 2004
We have a machine that must be both a client and director. The two problems to solve are
I have written a ip_vs_core.c.diff (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ip_vs_core.c.diff) patch for 2.4.26 using IPVS-NAT. It works correctly in my testcase. The schema is:
External client ---> IPVS:443 --> Local:443 ---> IPVS:80 ---> RealServer |
The problem happens when Local:443 goes to localIPVS:80, because the packet is discarded by the next lines in ip_vs_core.c:
if (skb->pkt_type != PACKET_HOST) || skb->dev == &loopback_dev) { IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n", skb->pkt_type, iph->protocol, NIPQUAD(iph->daddr)); return NF_ACCEPT; } |
Ratz
Why do you need this? Seems like a replication of mod_proxy/mod_rewrite. Your patch obviously makes it work but I wonder if such a functionality is really needed.
We are using it like an ssl accelerator. The first ipvs (443) sends the request to localhost:443 or to a different director, and the second ipvs(80), distributes the traffic in the realservers.
Ext. client --> IPVS:443 --> Local:443 --> IPVS:80 --> RealServer1 |-> Director2:443 |-> RealServer2 |
In the first case, it is a scheme "external machine client+director", but in the second case it is a "client+director in the same machine". This part of the patch only solves the output packet, the return is handled by the second part of the patch. (what is really a bad hack)
For a mini-HOWTO on using this patch see https_on_localnode. Matt Venn has tested it, it works using the local IP of the director, but not 127.0.0.1.
Note | |
---|---|
Graeme came up with the original idea, Rob Wilson proposed a solution that didn't quite work, Graeme fixed it and then Judd saw an easier solution for the case of only one VIP. I've somewhat mashed the history in my write-up (sorry). |
Graeme Fowler is looking for a solution for realservers that can't use iptables
Graeme Fowler graeme (at) graemef (dot) net 11 Feb 2005
After a long day spent tracing packets through the LVS and netfilter trail whilst trying to do cleverness with policy routing using the iproute2 package, I can condense quite a lot of reading (and trial, mainly followed by error!) down as follows:
Conclusions: mixing policy routing and LVS sounds like a great idea, and probably is if you're using LVS-DR or LVS-TUN. Just with LVS-NAT, it's a no-go (for me at the moment, anyway).
Graeme Fowler graeme (at) graemef (dot) net 2005/03/11
Solved... was Re: LVS-NAT: realserver as client (new thread, same subject!)
I've solved it - in as far as a proof of concept goes in testing. It's yet to be used under load though; however I can't see any specific problems ahead once I move it into production.
The solution of type "4" above involves a "classic" LVS-NAT cluster as follows. Nomenclature after DIP/RIP/VIP classification is "e" for external (ie. public address space), "i" for internal (ie. RFC1918 address space) and numbers to delimit machines.
Director: External NIC eth0 - DIPe, VIP1e Internal NIC eth1 - DIPi Realserver 1: Internal NIC eth1 - RIP1 Realserver 2: Internal NIC eth1 - RIP2 |
In normal (or "classic" as referred to above) LVS-NAT, the director has a virtual server configured on VIP1e to NAT requests into RIP1 and RIP2. Under these circumstances, as discussed in great length in several threads in Jan/Feb (and many times before), a request from a realserver to a VIP will not work, because:
src dst RIP1 SYN -> VIP1e RIP1 SYN -> RIP2 (or RIP1, doesn't matter) RIP2 ACK -> RIP1 |
at this point the connection never completes because the ACK comes from an unexpected source (RIP2 rather than VIP1e), so RIP1 drops the packet and continues sending SYN packets until the application times out. We need a way to "catch" this part of the connection and make sure that the packets don't get dropped. As it turns out, the hypothesis I put forward a month ago works well (rather to my surprise!), and involves both netfilter (iptables) to mangle the "client" packets with an fwmark, and the use of LVS-DR to process them.
What I now have (simplified somewhat, this assumes a single service is being load balanced in a very small cluster):
Director: External NIC eth0 - DIPe, VIP1e Internal NIC eth1 - DIPi Realserver 1: Internal NIC eth1 - RIP1 Loopback adapter lo:0 - VIP1e Realserver 2: Internal NIC eth1 - RIP2 Loopback adapter lo:0 - VIP1e |
The on the director:
/sbin/iptables -t mangle -I PREROUTING -p tcp -i eth1 \ -s $RIP_NETWORK_PREFIX -d $VIP1e --dport $PORT \ -j MARK --set-mark $MARKVALUE |
and we need a corresponding entry in the LVS tables for this. I'm using keepalived to manage it; yours may be different, but in a nutshell you need a virtual server on $MARKVALUE rather than an IP, using LVS-DR, pointing back to RIP1 and RIP2. Instead of me spamming configs, here's the ipvsadm -Ln output:
director# ipvsadm -Ln Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 92 wlc -> $RIP1:$PORT Route 100 0 0 -> $RIP2:$PORT Route 100 0 0 (empty connection table right now) |
...and believe it or not, that's it. Obviously the more VIPs you have, the more complex it gets but it's all about repeating the appropriate config with different RIP/VIP/mark values.
For ease of use I make the hexadecimal mark value match the last octet of the IP address on the VIP; it makes for easier reading when tracking stats and so on.
I've not addressed any problems with random ARP problems yet because they haven't yet occurred in testing; and one major bonus point is that if a connection is attempted from (ooh, let's say, without giving too much away) a server-side include on a virtual host on a realserver to another virtualhost on the same VIP, then it'll get handled locally as long as Apache (in my case) is configured appropriately.
An interesting, and useful, side-effect of this scheme is that when a realserver wants to connect to a VIP which it is handling, it'll connect to itself - which reduces greatly the amount of traffic traversing the RS -> Director -> RS network and means that the amount of actual load-balancing is reduced too.
Rob Wilson rewilson () gmail ! com 2005-08-09
We have an LVS server for testing which is handling 2 VIPs through LVS-NAT (using keepalived). Each of the VIPs currently points to 1 real server - it's a one realserver LVS - just in testing phase at the moment. Both real-servers are on the same internal network.
VIP1 -> Realserver1 VIP2 -> Realserver2 |
We'd now like Realserver2 to be able to connect to Realserver1 via VIP1. I was able to accomplish this following the solution provided by Graeme Fowler: http://www.in-addr.de/pipermail/lvs-users/2005-March/013517.html However, external connections to VIP1 no longer work while that solution is in place. Dropping the lo:0 interface assigned to VIP1 on Realserver1 fixes this, but then breaks Realserver2 from connecting.
Graeme Fowler graeme () graemef ! net 2005-08-10
Are you doing your testing from clients on the same LAN as the VIP, by any chance? Have you set the netmask on the lo:0 VIP address on the realservers to 255.255.255.255? I can see that making it a /24 mask - 255.255.255.0 - might result in the realservers thinking that the client is actually local to them, thus dropping the packets.
Rob Wilson rewilson () gmail ! com 2005-08-10
That's exactly it. I was hoping it was something daft I misconfigured, so.. wish granted :) It works perfectly now. Thanks for your help (and coming up with the idea in the first place!).
Judd Bourgeois simishag (at) gmail (dot) com 19 Jan 2006
I am running LVS-NAT, where the director has two NICs (and two networks). The VIP is on the inside of the director (in the RIP network) (Joe - this functions as a two network LVS-NAT). Some of my web sites proxy to "themselves" within a page (proxy, PRPC, includes, etc.). The symptom is that the proxy functionality breaks. The real server does a DNS lookup for the remote site, gets back the VIP, and hangs waiting for a response.
Previously I solved this problem by putting the site names and 127.0.0.1 in /etc/hosts (as mentioned in this section and in indexing), but after reading the FAQ more carefully tonight, I solved it by simply adding the VIP as a dummy interface on all of the realservers. This appears to be addressed in Graeme's solution, but he runs an extra iptables command on the director. Is this really necessary? Won't any packets originating on the real servers and destined for the VIP be handled by the dummy interface on the real server, without being put on the wire?
It all appears to work fine and has the added nice effect of forcing each realserver to proxy to itself when necessary.
Graeme Fowler graeme (at) graemef (dot) net 1/20/06
What you've suggested is the "single VIP" case of the above idea. It worked for me, it seems to have worked for Rob Wilson, so casting aside the fact that you might have multiple VIPs frontending multiple realserver clusters (as is my case) I can't see any reason why you shouldn't just go for it.
Judd Bourgeois simishag (at) gmail (dot) com 20 Jan 2006
Right. In fact, after reading your solution again, I think your solution is the more useful general case, where there may be an arbitrary number of VIPs, RIPs, and groupings of real servers (which I don't need right now, but I've realized I will need it down the road). I have some Alteons that call these real server groups, not sure what the LVS equivalent is, but here's a short illustration.
Assume 1 director, 3 VIPs, 4 RIPs on 4 real servers. Assume we have real server groups (RG) RG1 (RIP1-2), RG2 (RIP3-4), RG3 (RIP1-4). VIP1 goes to RG1, VIP2 goes to RG2, VIP3 goes to RG3.
In my solution, servers in RG1 can simply put VIP1 and VIP3 on dummy interfaces, but for proxy requests they will only be able to talk to themselves. They will not be able to talk to VIP2. All servers should be able to talk to VIP3. Your solution solves this by using fwmark.
This is a fairly common problem with NAT in general that I have to deal with a lot. Basically, the NAT box will not apply NAT rules for traffic originating and terminating on the NAT box. I recall that one workaround for this is to use the OUTPUT chain, I can't find the rules at present but it seemed to work ok.
Ratz 21 Jan 2006
There is no LVS equivalent of "real server groups". But I think Alteon (Nortel) only has this feature for adminstrative reasons, so you can assign a group by its identifier to a VIP. What I would love to see with LVS is the VSR approach and a proper and working imlementation of VRRP or CARP. I've just recently set up a 2208 switch using one VSRs and 2 VIRs, doing failover when either the link or the DGW is not reachable anymore. The sexy thing about this setup is that you don't need to fiddle around with arp problems and you don't need to have NAT, so balancing schedulers can get meaningful L7 information. Alteon's groups are just an administrative layer with an identifier. We could add such a layer in ipvsadm and the IPVS code, however what benefit do you see in such an approach?
One problem I see with the Alteon approach is that if you add a RS to a group, to my avail it can only pertain to one RG. This is a bit suboptimal if you want to use RS as spillover servers on top of their normal functionality. Regarding your example, I'd like to say, that RG1 is a spillover group for RG3. You can specify (IIRC) a spare server of each RG in AltheOS, however not cross-RG wise. Correct me if I'm wrong, please.
Judd
Graeme's solution solves this by using fwmark.
Yes, fwmark solves almost all problems
Graeme Fowler graeme (at) graemef (dot) net 21 Jan 2006
Judd doesn't need fwmark, because in a single VIP LVS-NAT, with that VIP assigned locally on the realservers on a dummy interface (or loopback alias), the realservers will always answer requests for the VIP locally.
In a two-VIP case (the simplest multiple), if you have two "groups" [0] of realservers, then the director becomes involved by virtue of it being the default gateway for the realservers. At the point the director gets involved you need some way of determining which interface your traffic is on, and segregation via fwmark seems the most elegant way to achieve this (given the known and predictable failure of realservers as clients in LVS-NAT). I know I struggled for months before realising that I could, in effect, combine the use of NAT via an external interface for my real clients, and DR via an internal interface for my "realservers as clients".
[0] I use the word groups in quotes and advisedly, since it appears that Alteon use that in their setup terminology from previous posts.
A NAT router rewrites source IP (and possibly the port) of packets coming from machines on the inside network. With an LVS-NAT director, the connection originates on the internet and terminates on the realserver (de-masquerading). The replies (from the realserver to the the lvs client) are masqueraded. In both cases (NAT router, LVS-NAT director), to the machine on the internet, the connection appears to be coming from the box doing the NAT'ing. However the NAT box has no connection (e.g. with netstat -an) to the box on the internet. It is just routing packets (and rewriting them).
Horms 17 May 2004
There is no connection as such. Or more specifically, the connection is routed, not terminated by the kernel. However, there is a proc entry, that you can inspect, to see the natted connections.
Tao Zhao taozhao (at) cs (dot) nyu (dot) edu 01 May 2002 LVS-NAT assumes that all servers are behind the director, so the director only need to change the destination IP when a request comes in and forward that to the scheduled realserver. When the reply packets go through the director it will change the source IP. This limits the deployment of LVS using NAT: the director must be the outgoing gateway for all servers.
I am wondering if I can change the code so that both source and destinamtion IPs are changed in both ways. For example, CIP: client IP; DIP: director IP; SIP: server IP (public IPs);
Client->Director->Server: address pair (CIP, DIP) is changed to (DIP, SIP) Server->Director->Client: address pair (SIP, DIP) is changed to (DIP, CIP).
Lars
Not very efficient; but this can actually already be done by using the port-forwarding feature AFAIK, or by a userspace application level gateway. I doubt its efficiency, since the director would _still_ need to be in between all servers and the client both ways. Direct routing and/or tunneling make more sense. As well clients do not know where the connection originally came from; making the logs on them nearly useless, also filtering by client IP and establishing a session back to the client (ie, ftp or some multimedia protocols) is also very difficult.
Wayne wayne (at) compute-aid (dot) com 01 May 2002
Client IP address is very important for analyzing the traffic for marketing people. Get rid of the CIP will make web server has no way to log where the traffic coming from, thus totally blind the marketing people, that is very undesirable for many use. Do you have to allocate a table for tracking these changes, too? That will further slow down the director.
Of course, the director need to allocate a new port number and change the source port number to it when it forwards the packet to the server. Thus this local port number should be enough for the director to distinguish different connections. This way, there will be no limitation where the servers are (the tunneling solution needs the change of server: setup tunneling)
Joe
I talked to Wensong about this in the early days of LVS, but I remember thinking that keeping track of the CIP would have been a lot of work. I think I mentioned it in the HOWTO for a while. However I'd be happy to use the code if someone else wrote it :-)
Some commercial load balancers seem to have some NAT-like scheme where the packets can return directly to the CIP without going through the director. Does anyone know how it works? (Actually I don't know whether it's NAT-like or not, I think there's some scheme out there that isn't VS-DR which returns packets directly from the realservers to the clients - this is called "direct server return" in the commercial world).
Wayne wayne (at) compute-aid (dot) com
I think those are switch-like load balancers. They don't take any IP addresses, But I think it could be done even with NAT, as long as the server has two NIC, one talk to the load balancer, the other talk to the switch/hub before the load balancer. The load balancer has to change the packet not have its own IP in it, so there is no need to NAT back to the public packet. Server set its default gateway using the other NIC to send the packets out.
frederic (dot) defferrard (at) ansf (dot) alcatel (dot) fr
would be possible to use LVS-NAT to load-balance virtual-IPs to ssh-forwarded real-IPs? Ssh can also be used to create a local access that is forwarded to a remote access throught the ssh protocol. For example you can use ssh to securely map a local acces to a remote POP server:
local:localport ==> local:ssh ~~~~~ ssh port forwarding ~~~~~ remote:ssh ==> remote:popAnd when you connect to local:localip you are transparently/securely connected to remote:pop The main idea is to allow RS in differents LANs with RS that are non-Linux (precluding LVS-Tun). Example:
- VS:81 ---- ssh ---- RS:80 / INTERNET - - - - > VS:80 (NAT)-- VS:82 ---- ssh ---- RS:80 \ - VS:83 ---- ssh ---- RS:80
Wensong
you can use VPN (or CIPE) to map some external realservers into your private cluster network. If you use LVS-NAT, make sure the routing on the realserver must be configuration properly so that the response packets will go through the load balancer to the clients.
I think that it isn't necessery to have the default router to the load balancer when using ssh because when the RS address is the same that the VS address (differents ports)
With the NAT method, your example won't work because the LVS/NAT treats packets as local ones and forward to the upper layers without any change.
However, your example give me an idea that we can dynamically redirect the port 80 to port 81, 82 and 83 respectively for different connections, then your example can work. However, the performance won't be good, because lots of works are done in the application level, and the overhead of copying from kernel to user-space is high.
Another thought is that we might be able to setup LVS/DR with real server in different LANs by using of CIPE/VPN stuff. For example, we use CIPE to establish tunnels from the load balancer to realservers like
10.0.0.1================10.0.1.1 realserer1 10.0.0.2================10.0.1.2 realserer2 --- Load Balancer 10.0.0.3================10.0.1.3 realserer3 10.0.0.4================10.0.1.4 realserer4 10.0.0.5================10.0.1.5 realserer5 |
Then, you can add LVS-DR configuration commands as:
ipvsadm -A -t VIP:www ipvsadm -a -t VIP:www -r 10.0.1.1 -g ipvsadm -a -t VIP:www -r 10.0.1.2 -g ipvsadm -a -t VIP:www -r 10.0.1.3 -g ipvsadm -a -t VIP:www -r 10.0.1.4 -g ipvsadm -a -t VIP:www -r 10.0.1.5 -g |
I haven't tested it. Please let me know the result if anyone tests this configuration.
Lucas 23 Apr 2004
Is it possible use the cluster as a NAT Router? What I'm saying is: I got a private LAN and I want to share my internet connection, doing NAT and Firewall and QoS. The realservers are actually routers and dont serve any service. Is there a way to use the VIP as the private LAN gateway or to pass the traffic through the director to the "real servers (real routers)" even when is not destined to a specific port in the server?
Horms 21 May 2004
I think that should work, as long as you are only wanting to route IPv4 TCP, UDP and related ICMP. You probably want to use a fwmark virtual service so that you can forward all ports to the realservers (routers). That said I haven't tried it, so I can't be sure.
Note | |
---|---|
Mar 2006: This will be in the next release of LVS. |
Ken Brownfield found that ipvs changes the routing of packets from the director to 0/0 (i.e. LVS-NAT or LVS-DR with the forward-shared patch). The packets from ipvs should use the routing table, but they don't. Ken had a director with two external NICS. He wanted the packets to return via the NIC they arrived. When he tried LVS-NAT, with his own installed routing table (which works when tested with traceroute), the reply packets from ip_vs are sent to the default gw, apparently ignoring his routing table. It should be none of ip_vs's business where the packets are routed.
Here's Ken's ip_vs_source_route.patch.gz patch.
Here's Ken's take on the matter
I need to support VIPs on the director that live on two separate external subnets:
| | eth0 | eth1 | eth0 = ISP1_IP on ISP1_SUBNET ---------------------- eth1 = ISP2_IP on ISP2_SUBNET | Director | ---------------------- internal | | |
The default gateway is on ISP1_SUBNET/eth0, and I have source routes set up as follows for eth1:
# cat /etc/SuSE-release SuSE Linux 9.0 (i586) VERSION = 9.0 # uname -a Linux lvs0 2.4.21-303-smp4G #1 SMP Tue Dec 6 12:33:10 UTC 2005 i686 i686 i386 GNU/Linux # ip -V ip utility, iproute2-ss020116 # ip rule list 0: from all lookup local 32765: from ISP2_SUBNET lookup 136 32766: from all lookup main 32767: from all lookup default # ip route show table 136 ISP2_SUBNET dev eth1 scope link src ISP2_IP default via ISP2_GW dev eth1 |
If I perform an mtr/traceroute on the director bind()ed to the ISP2_IP interface, outgoing traceroutes traverse the proper ISP2_GW, and the same for the ISP1_IP interface. I'm pretty sure the source- route behavior is correct, since I can revert from the proper behavior by dropping table 136.
For a single web service, I'm defining identical VIPs but for each of the ISPs:
-A -t ISP1_VIP:80 -s wlc -a -t ISP1_VIP:80 -r 10.10.10.10:80 -m -w 1000 -a -t ISP1_VIP:80 -r 10.10.10.11:80 -m -w 800 -A -t ISP2_VIP:80 -s wlc -a -t ISP2_VIP:80 -r 10.10.10.10:80 -m -w 1000 -a -t ISP2_VIP:80 -r 10.10.10.11:80 -m -w 800 |
Incoming packets come in via the proper gateway, but LVS always emits response packets through the default gateway, seemingly ignoring the source-route rules.
I've seen Henrick's general fwmark state tracking described. Reading this, it seems like this patch isn't exactly approved or even obviously available. And the article is from 2002. :)
I'm also not sure why this seems like such a difficult problem. If LVS honored routes, there would be no complicated hacks required. Unless LVS overrides routes, in which case it might be nice to have a switch to turn off that optimization.
I understand that routes are a subset of the problem fixed by the patch, and I can see the value of the patch. But for the basic route case it seems odd for LVS to just dump all outgoing packets to the default gw. I mean, it could cache the routing table instead of just a single gw?
From what I can tell, the SH scheduler decides which realserver will receive an incoming request based on the external source IP in the request. I can see four problems with this.
The docs state "Multiple gateway setups can be solved with routing and a solution is planned for LVS." Which seems to imply that source routing is a fix but sort of not... :(
Scanning the nfct patch and looking at the icmp handling, I'm pretty sure the problem is that ip_vs_out() is sending out the packet with a route calculated from the real server's IP. Since ip_vs_out() is reputedly only called for masq return traffic, I think this is just plain incorrect behavior.
I pulled out the route_me_harder() mod and created the attached patch. My only concern would be performance, but it seems netfilter's NAT uses this.
First, I need to correct the stated provenance of this patch. It is a small tweaked subset of an antefacto patch posted to integrate netfilter's connection tracking into LVS, not the nfct patches as I said. Lots of Googling, not enough brain cells. This patch applies to v1.0.10, but appears to be portable to 2.6.
During a maintenance window this morning, I had the opportunity to test the patch.
The first time I ever loaded the patched module, and shockingly it worked perfectly -- outbound traffic from masq VIPs now follows source-routes and choses the correct outbound gateway. No side effects so far, no obvious increased load.
I also poked around the 2.6 LVS source a bit to see if this issue had been resolved in later versions, and noticed uses of ip_route_output_key, but the source address was always set to 0 instead of something more specific. I'd say it might be worth a review of the LVS code to make sure source addresses are set usefully, and routes are recalculated where necessary.
In any case, if anyone has a similar problem with VIPs spanning multiple external IP spaces and gateways, this has been working like a charm for me in significant production load. So far. *knock*on*wood* I'll update if it crashes and/or burns.
Joe
any idea what would happen if there were multiple VIPs or the packets coming into the director from the outside world were arriving at the LVS code via a fwmark?
To my understanding, Henrick's fwmark patch allows LVS to route traffic based on fwmarks set by an admin in iptables/iproute2. I can imagine certain complex situations where this functionality could be useful and even crucial, but setup and maintenance of fwmarks requires specifically coded fwmark behavior in each of netfilter, iproute2, and ip_vs.
Source routes are essentially a standard feature these days, and are critical for proper routing on gateways and routers (which is essentially what a director is in Masq mode). Having LVS properly observe the routing table is a "missing feature", I believe. The patch I created requires no changes for an admin to make (no fwmarks to set up in ip_vs, netfilter, *and* iproute2), basically just properly and transparently observing routes set by iproute2 (which the rest of the director's traffic already obeys).
So short answer: Henrick's patch allows VIP routing based on fwmarks specifically created/handled by an admin for that purpose, whereas mine is a minor correction to existing code to properly recalculate the routes of outbound VS/NAT VIP traffic after mangling/masquerading of the source IP. A little end-result crossover, but really quite different. My (borrowed :) patch is essentially a one-liner, so the code complexity is very small and the behavior easily confirmable at a glance. The fwmark code is more invasive, seemingly.
Technically, I could have used fwmarks, but until someone needs that specific functionality, I suspect proper source-routing covers 90% of the alternate use cases. And it's the cleaner, more specific solution to my problem. But that's just me. :)
Your summary of SH matches my understanding -- it's hash-based persistence calculated from the client's source IP (vs destination in DH). It probably generates a good random, persistent distribution, which I can see being useful in a cluster environment where persistence is rewarded by caching/sessions/etc. WLC with persistence is probably a better bet for a load-balancer config, since it actually balances load. Without something like wackamole on the real servers, rr/sh/dh are happy to send traffic to dead servers, AFAICT.
Ken Brownfield krb (at) irridia (dot) com 22 Mar 2006
I'm attaching ip_vs_source_route.patch.gz, which is the patch itself. It patches ip_vs_core.c, adding a function call at the end of ip_vs_out() that recalculates the route for an outgoing packet after mangling/masquerading has occurred.
ip_vs_out(), according to the comments in the source (and my brief perusal of the code) is "used only for VS/NAT." There should be no effect on DR/TUN functionality as far as I can tell. This type of route recalc might be correct behavior in some TUN or DR circumstances, but I have no experience in a DR/TUN setup. So yes, I believe this patch is orthogonal to DR/TUN functionality and should be silent with regard to DR/TUN.
The only concern a user should have after applying this patch is that they make sure they are aware of existing source routes before using the patch. Users may be unknowingly relying on the fact that LVS always routes traffic based on the real server's source IP instead of the VIP IP, and applying the patch could change the behavior of their system. I suspect that will be a very rare concern.
As long as the source routes on the system are correct, where the source IP == the VIP IP, packets from LVS will be routed as the system itself routes packets. Routes confirmed with a traceroute (bound to a specific IP on the director) will no longer be ignored for traffic outbound from a NAT VIP.
Joe: next Farid Sarwari stepped in
Farid Sarwari fsarwari (at) exchangesolutions (dot) com 25 Jul 2006
I'm having some issues with IPVS and IPSec. When a HTTP client requests a page, I can see the traffic come all the way to the webserver (ws1,ws2). However, the return traffic gets to the load balancer but does not make it through the ipsec tunnel. When doing a tcpdump I can see that the packets get SNATed by ipvs. I know there is a problem with ipsec2.6 and SNAT, and I've upgraded my kernel and iptables so now SNAT with iptables works. But it looks like ipvs is doing its own SNAT which doesn't pass through the ipsec tunnel.
My setup: HTTP Clients ------- | \ -- Ipsec tunnel / | +------------+ |LoadBalancer| | ipsec2.6 | | ipvs | +------------+ | /\ / \ / \ +-----+ +-----+ | ws1 | | ws2 | +-----+ +-----+ Ldirector.conf: virtual=x.x.x.x:80 #<public ip> real=y.y.y.1:80 masq real=y.y.y.2:80 masq checktype=negotiate fallback=127.0.0.1:80 masq service=http request="/" receive=" " scheduler=wlc protocol=tcp ------------------ ipvsadm -ln output: P Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP x.x.x.x:80 wlc -> y.y.y.1:80 Masq 1 0 0 -> y.y.y.1:80 Masq 1 0 0 ------------------ Software Version #s: ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0) Linux Kernel 2.6.16 iptables v1.3.5 ldirectord version 1.131 |
The Brownfield patch is for an older version of ipvs. When I was applying the patch, Hunk#3 failed. I was able to apply the third hunk manually. When I compile it give errors for the code from the first hunk of the patch.
Finally got it to work! I can access load balanced pages through ipsec. Ken Brownfield's patch seemed to have been for an older version of kernel/ipvs. If you look in the patch, there is function called ip_vs_route_me_harder with is an exact copy of ip_route_me_harder from netfilter.c. I'm not sure what version of ipvs/kernel Brownfield's patch is for. I couldn't get ipvs to compile with his patch, so I just used his idea and copied the new code from the netfilter source code. I've modified his patch by copying new the ip_route_me_harder function from net/ipv4/netfiter.c (2.6.16). Below is the patch for kernel 2.6.16 (kernel sources from FC4)
IPVS Version: $Id: ip_vs_core.c,v 1.34 2003/05/10 03:05:23 wensong Exp ------snip-------- --- ip_vs_core.c.orig 2006-03-20 00:53:29.000000000 -0500 +++ ip_vs_core.c 2006-07-27 14:31:14.000000000 -0400 @@ -43,6 +43,7 @@ #include <net/ip_vs.h> +#include <net/xfrm.h> EXPORT_SYMBOL(register_ip_vs_scheduler); EXPORT_SYMBOL(unregister_ip_vs_scheduler); @@ -516,6 +517,76 @@ return NF_DROP; } +/* This code stolen from net/ipv4/netfilter.c */ + +int ip_vs_route_me_harder(struct sk_buff **pskb) +{ + struct iphdr *iph = (*pskb)->nh.iph; + struct rtable *rt; + struct flowi fl = {}; + struct dst_entry *odst; + unsigned int hh_len; + + /* some non-standard hacks like ipt_REJECT.c:send_reset() can cause + * packets with foreign saddr to appear on the NF_IP_LOCAL_OUT hook. + */ + if (inet_addr_type(iph->saddr) == RTN_LOCAL) { + fl.nl_u.ip4_u.daddr = iph->daddr; + fl.nl_u.ip4_u.saddr = iph->saddr; + fl.nl_u.ip4_u.tos = RT_TOS(iph->tos); + fl.oif = (*pskb)->sk ? (*pskb)->sk->sk_bound_dev_if : 0; +#ifdef CONFIG_IP_ROUTE_FWMARK + fl.nl_u.ip4_u.fwmark = (*pskb)->nfmark; +#endif + if (ip_route_output_key(&rt, &fl) != 0) + return -1; + + /* Drop old route. */ + dst_release((*pskb)->dst); + (*pskb)->dst = &rt->u.dst; + } else { + /* non-local src, find valid iif to satisfy + * rp-filter when calling ip_route_input. */ + fl.nl_u.ip4_u.daddr = iph->saddr; + if (ip_route_output_key(&rt, &fl) != 0) + return -1; + + odst = (*pskb)->dst; + if (ip_route_input(*pskb, iph->daddr, iph->saddr, + RT_TOS(iph->tos), rt->u.dst.dev) != 0) { + dst_release(&rt->u.dst); + return -1; + } + dst_release(&rt->u.dst); + dst_release(odst); + } + + if ((*pskb)->dst->error) + return -1; + +#ifdef CONFIG_XFRM + if (!(IPCB(*pskb)->flags & IPSKB_XFRM_TRANSFORMED) && + xfrm_decode_session(*pskb, &fl, AF_INET) == 0) + if (xfrm_lookup(&(*pskb)->dst, &fl, (*pskb)->sk, 0)) + return -1; +#endif + + /* Change in oif may mean change in hh_len. */ + hh_len = (*pskb)->dst->dev->hard_header_len; + if (skb_headroom(*pskb) < hh_len) { + struct sk_buff *nskb; + + nskb = skb_realloc_headroom(*pskb, hh_len); + if (!nskb) + return -1; + if ((*pskb)->sk) + skb_set_owner_w(nskb, (*pskb)->sk); + kfree_skb(*pskb); + *pskb = nskb; + } + + return 0; +} /* * It is hooked before NF_IP_PRI_NAT_SRC at the NF_IP_POST_ROUTING @@ -734,6 +805,7 @@ struct ip_vs_protocol *pp; struct ip_vs_conn *cp; int ihl; + int retval; EnterFunction(11); @@ -821,8 +893,20 @@ skb->ipvs_property = 1; - LeaveFunction(11); - return NF_ACCEPT; + /* For policy routing, packets originating from this + * machine itself may be routed differently to packets + * passing through. We want this packet to be routed as + * if it came from this machine itself. So re-compute + * the routing information. + */ + if (ip_vs_route_me_harder(pskb) == 0) + retval = NF_ACCEPT; + else + /* No route available; what can we do? */ + retval = NF_DROP; + + LeaveFunction(11); + return retval; drop: ip_vs_conn_put(cp); ------snip-------- |
Joe
Can you do IPSec with LVS-DR? (the director would only decrypt and the realservers encrypt)
I haven't tried it, but I don't see why it shouldn't work. It's probably easier to get work than LVS-NAT with IPSec :) You can think of Ipsec as just another interface except that with kernel 2.6 there is more ipsec0 interface. So as long as routing is setup correctly LVS-DR should work with IPSec.
so you have an ipsec0 interface and you can put an IP on it and route to/from it just like with eth0? Can you use iproute2 tools on ipsec0?
With Kernel 2.6 there is no more ipsec0 interface, but you can use iproute2 to alter the routing table. You wouldn't want to modify the routes to the tunnel because ipsec takes care of that, but you can modify routes for traffic that is coming through the tunnel destined for LVS-DR.
Ken Brownfield krb (at) irridia (dot) com 28 Jul 2006
At first glance, that's exactly what had to be ported, and I'm glad someone with enough 2.6 fu did it. Now, if someone could have it conditional on a proc/sysctl, it would seem like more of a no-brainer for inclusion. ;)
Joe: next David Black stepped in
David Black dave (at) jamsoft (dot) com 28 Jul 2006
I applied the following patch to a stock 2.6.17.7 kernel, and enabled the source routing hook via /proc/sys/net/ipv4/vs/snat_reroute: http://www.ssi.bg/~ja/nfct/ipvs-nfct-2.6.16-1.diff LVS-NAT connections now appear to obey policy routing - yay!
Referring to an older version of the NFCT patch, Ken Brownfield says in the LVS HOWTO: "I pulled out the route_me_harder() mod and created the attached patch." So the Brownfield patch is a derivative of the NFCT patch in the first place.
And here's a comment from the NFCT patch I used:
/* For policy routing, packets originating from this * machine itself may be routed differently to packets * passing through. We want this packet to be routed as * if it came from this machine itself. So re-compute * the routing information. |
For a patched kernel, that functionality is enabled by
echo 1 > /proc/sys/net/ipv4/vs/snat_reroute |
Farid Sarwari fsarwari (at) exchangesolutions (dot) com 31 Jul 2006
The problem I was having with ipvs was that I couldn't access it through ipsec kernel 2.6. I remember accessing ipvs through ipsec 2.4 a few years ago and I don't remember running into this problem. Correct me if I'm wrong, prior to kernel 2.6.16 SNAT (netfilter) didn't work properly with ipsec. When troubleshooting my problem it looked like the natting was happening after the routing decision had been made. This is why I was under the assumption that only code from kernel 2.6.16+ would fix my problem. If the nfct patch works with ipsec, I would much rather us that.
Joe
If Julian's patch had been part of the kernel ipvs code, would anyone have had source routing/iproute2 problems with LVS-NAT?
Ken 9 Aug 2006
I don't believe so -- the source-routing behavior appears to be a (happy) side-effect of working NFCT functionality. I think the NFCT and source-routing patches' intentions are to supply a feature and a bug-fix, respectively, but NFCT is an "accidental" superset.
Stephen Milton smmilton (at) gmail (dot) com 12/17/05
This may be old hat to many of you on this list, but I had a lot of problems deciphering all the issues around FTP in load balanced NAT. So I wrote up the howto on how I got my configuration to work. I was specifically trying to setup for high availability, load balanced, FTP and HTTP with failover and firewalling on the load balancer nodes. Here is a permanent link to the article: load_balanced_ftp_server (http://sacrifunk.milton.com/b2evolution/blogs/index.php/2005/12/17/load_balanced_ftp_server)
Michael Green mishagreen (at) gmail (dot) com
Is it possible to make Apache's IP based vhosts work under LVS-NAT?
Graeme Fowler graeme (at) graemef (dot) net 14 Dec 2005
If, by that, you mean Apache vhosts whereby a single vhost lives on a single IP then the answer is definitely "yes", although it may seem counter-intuitive at first.
If you're using IP based virtual hosting, you have a single IP address for *each and every* virtual host. In the 'classic' sense this means your server has one, two, a hundred, a thousand IP addresses configured (as aliases) on its' interface which faces the internet and a different vhost listens to each interface.
In the clearest case of LVS-NAT, you'd have your public interface on the director handle the one, two, a hundred, a thousand _public_ IP addresses and present those to the internet (or your clients, be those as they are). Assuming you have N realservers, you then require N*(one, two, a hundred, a thousand) private IP addresses and you configure up (one, two, a hundred, a thousand) aliases per virtual server. You then setup LVS-NAT to take each specific public IP and NAT it inbound to N private IPs on the realservers.
Still with me? Good.
This is a network management nightmare. Imagine you had 256 Virtual IPs, each with 32 servers in a pool. You immediately need to manage an entire /19 worth of space behind your director. That's a lot of address space (8192 addresses to be precise) for you to be keeping up with, and it's a *lot* of entries in your ipvsadm table.
There is, however, a trick you can use to massively simplify your addressing:
Put all your IP based vhosts on the same IP but a *different port* on each realserver. Suddenly you go from 8192 realserver address (aliases) to, well, 32 address (aliases) with 256 ports in use on each one. Much easier to manage.
For even more trickery you could probably make use of some of keepalived's config tricks to "pool" your realservers and make your configuration even more simple, but if you only have a small environment you may want to get used to using ipvsadm by hand first until you're happy with it.
This is musings from the mailing list about a type of functionality we don't have in LVS.
Paulo F. Andrade pfca (at) mega (dot) ist (dot) utl (dot) pt 11 Jul 2006
What I want is the following:
LVS-NAT only does DNAT, meaning CIP->VIP changes to CIP->RIP and the response from RIP->CIP to VIP->CIP. The problem is that after LVS changes the VIP to RIP for inbound connections, it seems that packets don't traverse the POSTROUTING chain to get SNAT'ed. Is there a workaround for this?
Graeme Fowler graeme (at) graemef (dot) net
Surely what you're asking for is a proxy rather than a director?
malcolm lists (at) netpbx (dot) org 13 Jul 2006
I think this is what F5 calls SNAT mode (which confuses LVS people). It's really nifty and flexible and I don't see why it can't be done at layer 4 with LVS... But LVS would need to be moved to the Forward chain rather than the INPUT one. Pretty similar to LVS-NAT so not really a proxy (it's not looking at the packet contents) I think it would be a massive improvement.. I'd even consider sponsoring someone to do it?
F5 will also check the response of the real server and if it fails re-send the commands from the cache to another server... nice but definitely layer7 proxy stuff...
You want this because:
I think Kemp technologies have managed to do it with their LVS implementation... I get a lot of customers who have their real servers so locked down they can't modify them at all.
Joe - Here's an example of someone wanting to use F5-SNAT
Hoffman, Jon Jon (at) Hoffman (at) acs-inc (dot) com 11 Oct 2006
I have two networks that are physcally located in different locations (lets say city X and city Y). In city X we have our web servers, run by our team there. In city Y we have our load balancer that we are tring to set up as a demo to show how LVS works. We can not set our default gateway of our web servers to be the load balancer because we are trying to test LVS and can not take our web servers out of production to test a new load balancer. And we want to see the load balancing working with our present servers. What is happening is our client makes a request to our director, the director sends the request to our web server and the web server responses directly back to the client, who has no idea why that server is sending the packet to it.
It does not make sense as to why I can not masqurade the request to the real server. For example: To really strip things down, say I have the following
Client: 192.168.10.10 director: 192.168.10.250 realserver: 172.18.1.200, 172.18.1.201, 172.18.1.202The client makes a request to the director, which then makes the request to one of the realservers but (according to my tcpdump) the request appears to come from the client (192.168.10.10), therefore the real server tries to send the request directly back to the client. Is there a way to make the request to the realserver appear to come from the director, so the realserver sends the request back to the director (without changing the default gw on the realserver) rather than to the client? It just seems like there should be a way to do this.
Malcolm lists (at) loadbalancer (dot) org 11 Oct 2006
Unfortunately the answer is no. Packets can't be SNAT'd after being LVS'd. In my limited understanding this is because LVS bypasses netfilter after it has grabbed the packets from the INPUT chain
Joe - ip_vs would have to be in the FORWARD chain for this to work.
Nicklas Bondesson nicklas (dot) bondesson (at) mindping (dot) com 24 Feb 2007
The SNAT rule is not working without the NFCT patch - this is why got my hands on the patch in the first place. I have scenarios like this:
Request: CLIENT -> VIP[with_public_ip_1] -> A_REAL_SERVER[private_ip_1] Response: A_REAL_SERVER[private_ip_1] -> VIP[with_public_ip_1] -> CLIENT --- Request: CLIENT -> VIP[with_public_ip_2] -> A_REAL_SERVER[private_ip_2] Response: A_REAL_SERVER[private_ip_2] -> VIP[with_public_ip_2] -> CLIENT |
I'm not sure if i'm beeing clear here, but in simple words: the same public ip address that the client uses to connect to the LVS should be used as source ip in the response to the client. I have multiple public ip addresses that i need to source nat. The firewall is on the same box as the director. Any pointers?
Julian 24 Feb 2007
Aha, I see why you are using snat_reroute. But I want to note the following things:
ok, but what do you see, what is the real problem? Packets are dropped and don't reach uplink router or they are not routed properly when you have 2 or more uplinks? Do you have source-based IP rules?
Nicklas Bondesson nicklas (dot) bondesson (at) mindping (dot) com
I am still unable to SNAT traffic leaving the box. I'm runnng the director and firewall on the same box. And this is how I do SNAT:
iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to-source 11.22.33.44 |
Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 23 Feb 2007
Niklas, if you mean masquerading of LVS-DR client IPs on their way to real servers, you can try my approach described below.
When I tried Julian's patch several month ago (I am not sure if this have changed), I found it not suitable for use on a director that would also do masquerading (SNAT) of client IPs. I have learned that when Julian says "SNAT" he means processing of packets coming from an LVS-NAT driven real server (OUT direction) while forwarding them to clients. IN direction packets never pass through the netfilter nat POSTROUTING hook (and conntrack POSTROUTING as well), the are sent out directly by ip_vs_out() with optional ip_vs_conntrack_confirm().
Some time ago I have set up an LVS-DR based internet gateway that not only accepts connections from the internet to a VIP and redirects them to several RIPs (typical IPVS usage), but also rediects connections from intranet clients to the internet through serveral DSL/FrameRelay links, or their respective routers acting as real servers (similiar to LVS driven transparent cache cluster case). As I have no control over these routers (they are managed by their respective providers), I have to do masquerading (or SNAT) on the director itself to avoid putting several more boxes in between. In order to achieve this functionality, I have started with a "hardware" method of sending IN packets back to the director via several vlans set up over 2 additional network interfaces connected with a crossover cable. Then I have created a small patch that affects processing of LVS-DR packets only (and bypass as well), so they are not caught by ip_vs_out() and just travel through all netfilter POSTROUTNG hooks, including nat and conntrack. This solution works for me as expected.
In my opinion, Julian's patch is particularily suitable for LVS-NAT, where any other approach would probably not work at all. Furthermore, it looks for me that Julian's way (or maybe any way) of connection tracking could be not applicable to LVS-TUN, where packets leaving the director are encapsulated before they reach ip_vs_out(). But for LVS-DR there are probably two good ways at least: Julian's, but without masquerading, and my own, that I have successfully used for several months now.
My patch applies cleanly against debian linux-source-2.6.18-3 version 2.6.18-7 and is also available at http://www.icnet.pl/download/ip_vs_dr-conntrack.patch
Signed-off-by: Janusz Krzysztofik <[email protected]> ================================================ --- linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_core.c.orig 2006-06-18 03:49:35.000000000 +0200 +++ linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_core.c 2006-10-21 21:38:20.000000000 +0200 @@ -672,6 +672,9 @@ static int ip_vs_out_icmp(struct sk_buff if (!cp) return NF_ACCEPT; + if (IP_VS_FWD_METHOD(cp) == IP_VS_CONN_F_DROUTE) + return NF_ACCEPT; + verdict = NF_DROP; if (IP_VS_FWD_METHOD(cp) != 0) { @@ -801,6 +804,9 @@ ip_vs_out(unsigned int hooknum, struct s return NF_ACCEPT; } + if (IP_VS_FWD_METHOD(cp) == IP_VS_CONN_F_DROUTE) + return NF_ACCEPT; + IP_VS_DBG_PKT(11, pp, skb, 0, "Outgoing packet"); if (!ip_vs_make_skb_writable(pskb, ihl)) --- linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_xmit.c.orig 2006-06-18 03:49:35.000000000 +0200 +++ linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_xmit.c 2006-10-21 21:22:56.000000000 +0200 @@ -127,7 +127,6 @@ ip_vs_dst_reset(struct ip_vs_dest *dest) #define IP_VS_XMIT(skb, rt) \ do { \ - (skb)->ipvs_property = 1; \ (skb)->ip_summed = CHECKSUM_NONE; \ NF_HOOK(PF_INET, NF_IP_LOCAL_OUT, (skb), NULL, \ (rt)->u.dst.dev, dst_output); \ @@ -278,6 +277,7 @@ ip_vs_nat_xmit(struct sk_buff *skb, stru /* Another hack: avoid icmp_send in ip_fragment */ skb->local_df = 1; + skb->ipvs_property = 1; IP_VS_XMIT(skb, rt); LeaveFunction(10); @@ -411,6 +411,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, s /* Another hack: avoid icmp_send in ip_fragment */ skb->local_df = 1; + skb->ipvs_property = 1; IP_VS_XMIT(skb, rt); LeaveFunction(10); @@ -542,6 +543,7 @@ ip_vs_icmp_xmit(struct sk_buff *skb, str /* Another hack: avoid icmp_send in ip_fragment */ skb->local_df = 1; + skb->ipvs_property = 1; IP_VS_XMIT(skb, rt); rc = NF_STOLEN; |