The standard network tools (e.g. ifconfig, route and netstat) aren't capable on setting up some of the features used in newer LVSs e.g. routing based on src_addr. For this we use iproute2, which allows routing based on almost any of the parameters of a packet (src, dest, proto, tos...). iproute2 is available at iproute2-current.tar.gz. iproute2 implements similar functionality to cisco's IOS.
If there is only one possible route for packets, then ifconfig and route are just fine. If multiple routes exist then iproute2 is needed.
Presumably routing in Linux and the setup of LVS will move more toward using iproute2. The configure script will use the iproute2 package to do some configuration if you have it installed.
iproute2 uses labels instead of ip_aliases (e.g. eth0:110). ip_tables is based on the same underlying code and also requires labels to recognise ip_aliases. If you want to see the network as ip_tables sees it, you need the iproute2 tools.
iproute2 is not compatible with ifconfig,route and netstat. The entries added by the iproute2 tools are not seen by ifconfig/route etc and the output of ifconfig/route etc will be incorrect. You can't tell from looking at the output of ifconfig/route whether iproute2 commands have been run - you just have to know. The iproute2 tools correctly interpret the results of ifconfig/route commands and will give the correct state of the network.
Unfortunately the user interface to iproute2 is not easy.
The documentation is not easy to read (although it was all Julian needed).
Ratz suggested "Policy Routing Using Linux" by Matthew G. Marsh, Pub Sams 2001, ISBN 0-672-32052-5, to get you started (it helped me). (Oct 2002) Ratz has just found that the book is also online
Padraig Brady padraig (at) antefactor (dot) com suggests Linux Advanced Routing and Traffic Control HOWTO.
The output from the commands is difficult to parse (see the comments in the configure script for more details) - i.e. it's not machine readable. If the route is 0/0 then it is not listed in the output and the next output item shifts one field. This means that you have to know the route before you can parse the output. Ratz is developing a wrapper for iproute2 that will give machine readable output. To have a command line utility which is not machine readable is intolerable.
There are other problems
Joe, Dec 2003
The latest on Alexey's ftp site is 2.4.7 from Jan 2002. Is this really the latest?
Alejandro Mery amery (at) geeks (dot) cl 24 Dec 2003
2.4.7-now-ss010824 is the official lastest 'stable' but Bert Hubert (ahu) (Bert's website, http://ds9a.nl/) from lartc.org had an 'almost-branch' with some fixes and improvements with the date 2002-10-20. Bert's code is downloadable at http://ds9a.nl/cgi-bin/viewcvs.cgi/iproute2-ahu/iproute2-ahu.tar.gz?tarball=1&only_with_tag=HEAD" . Sadly both Bert's and Alexey's code are unmantained.
Example:
In a normally functioning LVS-DR, with routing setup by "route" the realservers will be sending packets with the following routing
- src_addr=VIP dest_addr=0/0. dest=0/0 - route via default gw
- src_addr=RIP dest_addr=RIP network. dest=RIP network - route to RIP network
In LVS-DR a packet leaving the realserver can exit via the default gw or the director. In the standard setup, packets with dst_addr=RIPnetwork are put onto the local network and all other packets are sent to the default gw.
If instead the routing is setup by "iproute2", packets with src_addr=VIP are sent to the default gw, while packets with src_addr=RIP are put onto the local network. The realservers will be sending packets with the following routing
- src_addr=VIP dest_addr=0/0. src=VIP - route via default gw
- src_addr=RIP dest_addr=RIP network. src=RIP - route to RIP network
The result for a normal working LVS, will be the same (i.e. the LVS will still work). However with the standard setup, packets with scr_addr=RIP cannot get to the outside world (the director does not have a default route to 0/0). If a process needs this (e.g. the operator needs to telnet out, or the realserver needs DNS), then those packets from the RIP can be NAT'ed out via the director (or you can setup the realservers as if they are part of a 3 Tier LVS LVS). For security, all packets from the VIP have to go out the default gw (including any to say the DIP, which will be dropped by rules on the default gw, to prevent spoofing).
- src_addr=VIP dest_addr=RIP network. src=VIP - route via default gw, will be dropped
- src_addr=RIP dest_addr=0/0. src=RIP - route to RIP network. If the director has the correct NAT rules, then these packets can pass to the outside world.
Lawrence Strydom laurie (at) midafrica (dot) com 26 May 2003
Is it possible to set up heartbeat between a Linux and a Windose box. The MS box will be the master node and the Linux box will provide redundancy.(dont ask! it is what the client wants)
Horms
It should be theoretically possible to run heartbeat on Windows. But to my knowledge no one has done this in the past. The heartbeat code is reasonably portable (between different Unix-like operating systems) but it is likely that you will need to do quite a lot of work to get it to compile and work correctly on Windows. I have no experince with using cygwin so I can't comment any further than that.
(with Julian)
(I needed this information to setup a one-net LVS-NAT LVS. However since it is about routing and not LVS specifically, maybe I should move it elsewhere.)
The routes added with route go into the kernel FIB (Forwarding information base) route table. The contents are displayed with route (or netstat -a).
Following an icmp redirect, the route updates go into the kernel's route cache (route -C).
You can flush the route cache with
echo 1 > /proc/sys/net/ipv4/route/flush or ip route flush cache |
Here's the route cache on the realserver before any packets are sent.
realserver:/etc/rc.d# route -C Kernel IP routing cache Source Destination Gateway Flags Metric Ref Use Iface realserver director director 0 1 0 eth0 director realserver realserver il 0 0 9 lo |
With icmp redirects enabled on the director, repeatedly running traceroute to the client shows the routes changing from 2 hops to 1 hop. This indicates that the realserver has received an icmp redirect packet telling it of a better route to the client.
realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.932 ms 0.562 ms 0.503 ms 2 client (192.168.1.254) 1.174 ms 0.597 ms 0.571 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.72 ms 0.581 ms 0.532 ms 2 client (192.168.1.254) 0.845 ms 0.559 ms 0.5 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 client (192.168.1.254) 0.69 ms * 0.579 ms |
Although route shows no change in the FIB, the route cache has changed. (The new route of interest is bracketted by >< signs in the table below.)
realserver:/etc/rc.d# route -C Kernel IP routing cache Source Destination Gateway Flags Metric Ref Use Iface client realserver realserver l 0 0 8 lo realserver realserver realserver l 0 0 1038 lo realserver director director 0 1 138 eth0 >realserver client client 0 0 6 eth0< director realserver realserver l 0 0 9 lo director realserver realserver l 0 0 168 lo |
Packets to the client now go directly to the client instead of via the director (which you don't want).
It takes about 10mins for the client's route cache to expire (experimental result). The timeouts may be in /proc/sys/net/ipv4/route/gc_*, but their location and values are well encrypted in the sources :) (some more info from Alexey at LVS archives)
Here's the route cache after 10mins.
realserver:/etc/rc.d# route -C Kernel IP routing cache Source Destination Gateway Flags Metric Ref Use Iface realserver realserver realserver l 0 0 1049 lo realserver director director 0 1 139 eth0 director realserver realserver l 0 0 0 lo director realserver realserver l 0 0 236 lo |
There are no routes to the client anymore. Checking with traceroute, shows that 2 hops are initially required to get to the client (i.e. the routing cache has reverted to using the director as the route to the client). After 2 iterations, icmp redirects route the packets directly to the client again.
realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.908 ms 0.572 ms 0.537 ms 2 client (192.168.1.254) 1.179 ms 0.6 ms 0.577 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.695 ms 0.552 ms 0.492 ms 2 client (192.168.1.254) 0.804 ms 0.55 ms 0.502 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 client (192.168.1.254) 0.686 ms 0.533 ms * |
If you now turn off icmp redirects on the director.
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects |
Checking routes on the realserver -
realserver:/etc/lvs# netstat -rn Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 director 0.0.0.0 UG 0 0 0 eth0 |
nothing has changed here.
Flush the kernel routing table and show the kernel routing table -
realserver:/etc/lvs# ip route flush cache realserver:/etc/lvs# route -C Kernel IP routing cache Source Destination Gateway Flags Metric Ref Use Iface realserver director director 0 1 0 eth0 director realserver realserver l 0 0 1 lo |
There are now no routes to the client.
Now when you send packet to the client, the route stays via the director needing 2 hops to get to the client. There are no one hop packets to the client.
realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.951 ms 0.56 ms 0.491 ms 2 client (192.168.1.254) 0.76 ms 0.599 ms 0.574 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.696 ms 0.562 ms 0.583 ms 2 client (192.168.1.254) 0.62 ms 0.603 ms 0.576 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.692 ms * 0.599 ms 2 client (192.168.1.254) 0.667 ms 0.603 ms 0.579 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.689 ms 0.558 ms 0.487 ms 2 client (192.168.1.254) 0.61 ms 0.63 ms 0.567 ms realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.705 ms 0.563 ms 0.526 ms 2 client (192.168.1.254) 0.611 ms 0.595 ms * realserver:/etc/rc.d# traceroute client traceroute to client (192.168.1.254), 30 hops max, 40 byte packets 1 director (192.168.1.9) 0.706 ms 0.558 ms 0.535 ms 2 client (192.168.1.254) 0.614 ms 0.593 ms 0.573 ms |
The kernel route cache
realserver:/etc/rc.d# route -C Kernel IP routing cache Source Destination Gateway Flags Metric Ref Use Iface client realserver realserver l 0 0 17 lo realserver realserver realserver l 0 0 2 lo realserver director director 0 1 0 eth0 >realserver client director 0 0 35 eth0< director realserver realserver l 0 0 16 lo director realserver realserver l 0 0 63 lo |
shows that the only route to the client (labelled with ><) is via the director.
For send_redirects, what's the difference between all, default and eth0?
Julian
see the LVS archives
When the kernel needs to check for a feature (e.g. send_redirects) it uses calls like:
if (IN_DEV_TX_REDIRECTS(in_dev)) ...These macros are defined in /usr/src/linux/include/linux/inetdevice.h
The macro returns a value using expression from all/<var> and <dev>/<var>. So, these macros check for example for: all/send_redirects || eth0/send_redirects or all/hidden && eth0/hidden.
when you create eth0 for first time using ifconfig eth0 ... up default/send_redirects is copied to eth0/send_redirects from the kernel, internally. i.e. default/ contains the initial values the device inherits when it is created. This is the safest way a device to appear with correct conf/<dev>/ values.
When we put a value in all/<var> you can assume that we set the <var>. When we put value in all/<var> you can assume that we set the <var> for all devices in this way:
all/<var> the macro returns: for && 0 0 for && 1 the value from <dev>/<var> for || 0 the value from <dev>/<var> for || 1 1This scheme allows the different devices to have different values for their vars. e.g. if we set 0 to all/send_redirects, the 3th line applies to the values, i.e. the result from the macro is the real value in <dev>/send_redirects. If we set 1 to all/send_redirects according to the 4th line, the macro always returns 1 regardless of the <dev>/send_redirects.
how to debug/understand TCP/IP packets?
Julian
The RFC documents http://www.ietf.cnri.reston.va.us/rfc.html are your friends. The numbers you need:
793 TRANSMISSION CONTROL PROTOCOL 1122 Requirements for Internet Hosts -- Communication Layers 1812 Requirements for IP Version 4 Routers 826 An Ethernet Address Resolution Protocolfor tcpdump, see man tcpdump.
for Microsoft NT _server_
Steve (dot) Gonczi (at) networkengines (dot) com
there is a uSoft supplied packet capture utility as well.
also -W. Richard Stevens: TCP-IP Illustrated, Vol 1, a good intro into packet layouts and protocol basics. (anything by Stevens is good - Joe).
Ivan Figueredo idf (at) weewannabe (dot) com
for windump - http://netgroup-serv.polito.it/windump/
Packets leaving a LVS-DR realserver can have src_addr=VIP or src_addr=RIP. If the default gw is different for each packet, it would be nice to have a command line testing tool like ping or traceroute to test the route. The normal tools will create packets with src_addr=RIP and you won't be able to test the packets with src_addr=VIP.
Roberto Nibali ratz (at) tac (dot) ch 22 May 2001
maybe hping can help you.
Joe
Ah, the file hping2.8 is the man page i.e. {hping2}.8 - I thought it was v2.8 of hping.
How about:
ip route get $IP? |
didn't know about "get". yes that works. It's like a -C with iptables. I'd still like to send a packet and see where it goes rather than getting an answer about where it is expected to go.
Julian
Not possible with src interface "lo" but possible with source address configured in "lo". Oh yes, "source interface" for some tools means "get one address from this iface and use it". In most of the cases these tools don't do the Right Thing.
from iproute2
$ ping -I src dst |
arping -I if -s src dst |
see Julian's notes and patches to handle the arp problem with iproute2 (this is somewhat developemental).
from Julian
This will look at the routing tables and tell you the route to xxx.xxx.xxx.xxx
ip route get xxx.xxx.xxx.xxx |
If you already have a route from A to B, and want to add another, you can't, you have to append the extra route.
dynnema dynnema (at) yahoo (dot) com Mar 22 2002
Lets say I got one RS and two NAT DIRs.
RS: RIP1: 192.168.1.2/24 dev eth0 RIP2 192.168.2.2/24 dev eth0:10 DIR1: VIP: x.x.x.69 eth0:110 DIP 192.168.1.1 DIR2: VIP: x.x.x.70 eth0:110 DIP 192.168.2.1 |
I add the first route
ip route add src 192.168.1.2 via 192.168.1.1 |
but then I can't add the second route:
ip route add src 192.168.2.2 via 192.168.2.1: "RTNETLINK answers: File exists" |
Careful reading of IProute mailing list was very useful. It should be
ip route append src 192.168.2.2 via 192.168.2.1 |
Ratz 25 Nov 2003
iproute2 is compatible with ifconfig/route/netstat but not vice versa. The two biggest issues people new to iproute2 have to struggle with are:
One of the problems with the iproute2 utils is that the syntax is not machine readable (and difficult for humans too). Ratz has built some wrappers around these utils.
Ratz 25 Nov 2003
If you guys are interested I'll offer my first semi-official release of some of the replacement tools I've written for ifconfig/route. You can download them from Ratz's wrappers http://www.drugphish.ch/~ratz/iproute2/
It's still not really scriptable (I wrote it with really gross bash constructs and by using external tools ;). BUT, it solves some of architectural principles, such as separation of concern, correctness, flexibility, conceptional integrity, coupling and cohesion! You are given two tools to maintain almost everything network related. (I'm aware that iptables/netfilter and mii-tool, ethtool are also network related)
ifconfig gives you the (wrong) impression that eth0:0 is an interface, just as others in the output ifconfig -a. This is not true. The iproute2 tools correctly displays the relationship between aliases/labels and their corresponding physical interface.
Example:
laphish:~ # ifconfig -a | grep -A2 eth0 eth0 Link encap:Ethernet HWaddr 00:20:E0:68:71:3A inet addr:172.23.2.131 Bcast:172.23.255.255 Mask:255.255.0.0 inet6 addr: fe80::220:e0ff:fe68:713a/64 Scope:Link -- eth0:0 Link encap:Ethernet HWaddr 00:20:E0:68:71:3A inet addr:10.98.43.233 Bcast:10.98.43.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 -- eth0:foo Link encap:Ethernet HWaddr 00:20:E0:68:71:3A inet addr:10.23.7.233 Bcast:10.23.7.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 laphish:~ # |
Sure, one could argue that all HWaddr of those "interfaces" are the same and thus something with the interpretation of them being _real_ physical interfaces must be fishy. But it gives you the wrong idea of connection or entity relationship between link and ip layer.
Now let's compare the same output for iproute2:
laphish:~ # ip addr show dev eth0 2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100 link/ether 00:20:e0:68:71:3a brd ff:ff:ff:ff:ff:ff inet 172.23.2.131/16 brd 172.23.255.255 scope global eth0 inet 10.23.7.233/24 brd 10.23.7.255 scope global eth0:foo inet 10.98.43.233/24 brd 10.98.43.255 scope global eth0:0 inet 10.239.10.1/24 brd 10.239.10.255 scope global eth0 inet6 fe80::220:e0ff:fe68:713a/64 scope link laphish:~ # |
As you can see we have a physical interface (link layer entity) called eth0 and associated with this interface we have 5 (not 4 like with ifconfig) IP addresses. And you can certainly well spot the labels which in ifconfig were displayed as independant interfaces at the end of each line starting with inet, right?
Plus there you certainly noted that in the second output we have one additional address which was not shown in the ifconfig output but is very well routable and _is_ a valid configuration. I simply didn't want to put an alias there.
Tools like ipchains and iptables and their underlying state machine are better off matching for ip addresses and the _one_ physical interface those are attached to then trying to fiddle around with a label that is optional and doesn't give you real valuable information. Additionally with iproute2 you have a better approach to conceptional integrity which is one of the key ingredients of architectures in that you say that even if I have multiple addresses for one interface I still send out the packet through the physical interface and not through a labeled, aliased or virtual interface.
ifconfig is an example of a "hiding complexity" tool. Hiding complexity is a concept the software industry has not yet adopted to the extent that we can trust it, and thus ifconfig is broken by design.
The reasons why people still use those deprecated tools are: