36. LVS: Newer networking tools: Policy Routing

36.1. Introduction

The standard network tools (e.g. ifconfig, route and netstat) aren't capable on setting up some of the features used in newer LVSs e.g. routing based on src_addr. For this we use iproute2, which allows routing based on almost any of the parameters of a packet (src, dest, proto, tos...). iproute2 is available at iproute2-current.tar.gz. iproute2 implements similar functionality to cisco's IOS.

If there is only one possible route for packets, then ifconfig and route are just fine. If multiple routes exist then iproute2 is needed.

Presumably routing in Linux and the setup of LVS will move more toward using iproute2. The configure script will use the iproute2 package to do some configuration if you have it installed.

iproute2 uses labels instead of ip_aliases (e.g. eth0:110). ip_tables is based on the same underlying code and also requires labels to recognise ip_aliases. If you want to see the network as ip_tables sees it, you need the iproute2 tools.

iproute2 is not compatible with ifconfig,route and netstat. The entries added by the iproute2 tools are not seen by ifconfig/route etc and the output of ifconfig/route etc will be incorrect. You can't tell from looking at the output of ifconfig/route whether iproute2 commands have been run - you just have to know. The iproute2 tools correctly interpret the results of ifconfig/route commands and will give the correct state of the network.

Unfortunately the user interface to iproute2 is not easy.

  • The documentation is not easy to read (although it was all Julian needed).

    Ratz suggested "Policy Routing Using Linux" by Matthew G. Marsh, Pub Sams 2001, ISBN 0-672-32052-5, to get you started (it helped me). (Oct 2002) Ratz has just found that the book is also online

    Padraig Brady padraig (at) antefactor (dot) com suggests Linux Advanced Routing and Traffic Control HOWTO.

  • The output from the commands is difficult to parse (see the comments in the configure script for more details) - i.e. it's not machine readable. If the route is 0/0 then it is not listed in the output and the next output item shifts one field. This means that you have to know the route before you can parse the output. Ratz is developing a wrapper for iproute2 that will give machine readable output. To have a command line utility which is not machine readable is intolerable.

There are other problems

Joe, Dec 2003

The latest on Alexey's ftp site is 2.4.7 from Jan 2002. Is this really the latest?

Alejandro Mery amery (at) geeks (dot) cl 24 Dec 2003

2.4.7-now-ss010824 is the official lastest 'stable' but Bert Hubert (ahu) (Bert's website, http://ds9a.nl/) from lartc.org had an 'almost-branch' with some fixes and improvements with the date 2002-10-20. Bert's code is downloadable at http://ds9a.nl/cgi-bin/viewcvs.cgi/iproute2-ahu/iproute2-ahu.tar.gz?tarball=1&only_with_tag=HEAD" . Sadly both Bert's and Alexey's code are unmantained.

36.2. Policy Routing and ifconfig

Example:

In a normally functioning LVS-DR, with routing setup by "route" the realservers will be sending packets with the following routing

  • src_addr=VIP dest_addr=0/0. dest=0/0 - route via default gw
  • src_addr=RIP dest_addr=RIP network. dest=RIP network - route to RIP network

In LVS-DR a packet leaving the realserver can exit via the default gw or the director. In the standard setup, packets with dst_addr=RIPnetwork are put onto the local network and all other packets are sent to the default gw.

If instead the routing is setup by "iproute2", packets with src_addr=VIP are sent to the default gw, while packets with src_addr=RIP are put onto the local network. The realservers will be sending packets with the following routing

  • src_addr=VIP dest_addr=0/0. src=VIP - route via default gw
  • src_addr=RIP dest_addr=RIP network. src=RIP - route to RIP network

The result for a normal working LVS, will be the same (i.e. the LVS will still work). However with the standard setup, packets with scr_addr=RIP cannot get to the outside world (the director does not have a default route to 0/0). If a process needs this (e.g. the operator needs to telnet out, or the realserver needs DNS), then those packets from the RIP can be NAT'ed out via the director (or you can setup the realservers as if they are part of a 3 Tier LVS LVS). For security, all packets from the VIP have to go out the default gw (including any to say the DIP, which will be dropped by rules on the default gw, to prevent spoofing).

  • src_addr=VIP dest_addr=RIP network. src=VIP - route via default gw, will be dropped
  • src_addr=RIP dest_addr=0/0. src=RIP - route to RIP network. If the director has the correct NAT rules, then these packets can pass to the outside world.

Lawrence Strydom laurie (at) midafrica (dot) com 26 May 2003

Is it possible to set up heartbeat between a Linux and a Windose box. The MS box will be the master node and the Linux box will provide redundancy.(dont ask! it is what the client wants)

Horms

It should be theoretically possible to run heartbeat on Windows. But to my knowledge no one has done this in the past. The heartbeat code is reasonably portable (between different Unix-like operating systems) but it is likely that you will need to do quite a lot of work to get it to compile and work correctly on Windows. I have no experince with using cygwin so I can't comment any further than that.

36.3. Various debugging techniques for routes

(with Julian)

(I needed this information to setup a one-net LVS-NAT LVS. However since it is about routing and not LVS specifically, maybe I should move it elsewhere.)

The routes added with route go into the kernel FIB (Forwarding information base) route table. The contents are displayed with route (or netstat -a).

Following an icmp redirect, the route updates go into the kernel's route cache (route -C).

You can flush the route cache with

	echo 1 > /proc/sys/net/ipv4/route/flush
or
	ip route flush cache

Here's the route cache on the realserver before any packets are sent.

realserver:/etc/rc.d# route -C
Kernel IP routing cache
Source          Destination     Gateway         Flags Metric Ref    Use Iface
realserver      director        director              0      1        0 eth0
director        realserver      realserver      il    0      0        9 lo

With icmp redirects enabled on the director, repeatedly running traceroute to the client shows the routes changing from 2 hops to 1 hop. This indicates that the realserver has received an icmp redirect packet telling it of a better route to the client.

realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.932 ms  0.562 ms  0.503 ms
 2  client (192.168.1.254)  1.174 ms  0.597 ms  0.571 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.72 ms  0.581 ms  0.532 ms
 2  client (192.168.1.254)  0.845 ms  0.559 ms  0.5 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  client (192.168.1.254)  0.69 ms *  0.579 ms

Although route shows no change in the FIB, the route cache has changed. (The new route of interest is bracketted by >< signs in the table below.)

 realserver:/etc/rc.d# route -C
 Kernel IP routing cache
 Source          Destination     Gateway         Flags Metric Ref    Use Iface
 client          realserver      realserver      l     0      0        8 lo
 realserver      realserver      realserver      l     0      0     1038 lo
 realserver      director        director              0      1      138 eth0
>realserver      client          client                0      0        6 eth0<
 director        realserver      realserver      l     0      0        9 lo
 director        realserver      realserver      l     0      0      168 lo

Packets to the client now go directly to the client instead of via the director (which you don't want).

It takes about 10mins for the client's route cache to expire (experimental result). The timeouts may be in /proc/sys/net/ipv4/route/gc_*, but their location and values are well encrypted in the sources :) (some more info from Alexey at LVS archives)

Here's the route cache after 10mins.

realserver:/etc/rc.d# route -C
Kernel IP routing cache
Source          Destination     Gateway         Flags Metric Ref    Use Iface
realserver      realserver      realserver      l     0      0     1049 lo
realserver      director        director              0      1      139 eth0
director        realserver      realserver      l     0      0        0 lo
director        realserver      realserver      l     0      0      236 lo

There are no routes to the client anymore. Checking with traceroute, shows that 2 hops are initially required to get to the client (i.e. the routing cache has reverted to using the director as the route to the client). After 2 iterations, icmp redirects route the packets directly to the client again.

realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.908 ms  0.572 ms  0.537 ms
 2  client (192.168.1.254)  1.179 ms  0.6 ms  0.577 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.695 ms  0.552 ms  0.492 ms
 2  client (192.168.1.254)  0.804 ms  0.55 ms  0.502 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  client (192.168.1.254)  0.686 ms  0.533 ms *

If you now turn off icmp redirects on the director.

director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects

Checking routes on the realserver -

realserver:/etc/lvs# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0

nothing has changed here.

Flush the kernel routing table and show the kernel routing table -

realserver:/etc/lvs# ip route flush cache
realserver:/etc/lvs# route -C
Kernel IP routing cache
Source          Destination     Gateway         Flags Metric Ref    Use Iface
realserver      director        director              0      1        0 eth0
director        realserver      realserver      l     0      0        1 lo

There are now no routes to the client.

Now when you send packet to the client, the route stays via the director needing 2 hops to get to the client. There are no one hop packets to the client.

realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.951 ms  0.56 ms  0.491 ms
 2  client (192.168.1.254)  0.76 ms  0.599 ms  0.574 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.696 ms  0.562 ms  0.583 ms
 2  client (192.168.1.254)  0.62 ms  0.603 ms  0.576 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.692 ms *  0.599 ms
 2  client (192.168.1.254)  0.667 ms  0.603 ms  0.579 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.689 ms  0.558 ms  0.487 ms
 2  client (192.168.1.254)  0.61 ms  0.63 ms  0.567 ms
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.705 ms  0.563 ms  0.526 ms
 2  client (192.168.1.254)  0.611 ms  0.595 ms *
realserver:/etc/rc.d# traceroute client
traceroute to client (192.168.1.254), 30 hops max, 40 byte packets
 1  director (192.168.1.9)  0.706 ms  0.558 ms  0.535 ms
 2  client (192.168.1.254)  0.614 ms  0.593 ms  0.573 ms

The kernel route cache

 realserver:/etc/rc.d# route -C
 Kernel IP routing cache
 Source          Destination     Gateway         Flags Metric Ref    Use Iface
 client          realserver      realserver      l     0      0       17 lo
 realserver      realserver      realserver      l     0      0        2 lo
 realserver      director        director              0      1        0 eth0
>realserver      client          director              0      0       35 eth0<
 director        realserver      realserver      l     0      0       16 lo
 director        realserver      realserver      l     0      0       63 lo

shows that the only route to the client (labelled with ><) is via the director.

For send_redirects, what's the difference between all, default and eth0?

Julian

see the LVS archives

When the kernel needs to check for a feature (e.g. send_redirects) it uses calls like:

if (IN_DEV_TX_REDIRECTS(in_dev)) ...

These macros are defined in /usr/src/linux/include/linux/inetdevice.h

The macro returns a value using expression from all/<var> and <dev>/<var>. So, these macros check for example for: all/send_redirects || eth0/send_redirects or all/hidden && eth0/hidden.

when you create eth0 for first time using ifconfig eth0 ... up default/send_redirects is copied to eth0/send_redirects from the kernel, internally. i.e. default/ contains the initial values the device inherits when it is created. This is the safest way a device to appear with correct conf/<dev>/ values.

When we put a value in all/<var> you can assume that we set the <var>. When we put value in all/<var> you can assume that we set the <var> for all devices in this way:

                all/<var>       the macro returns:
for &&          0               0
for &&          1               the value from <dev>/<var>
for ||          0               the value from <dev>/<var>
for ||          1               1

This scheme allows the different devices to have different values for their vars. e.g. if we set 0 to all/send_redirects, the 3th line applies to the values, i.e. the result from the macro is the real value in <dev>/send_redirects. If we set 1 to all/send_redirects according to the 4th line, the macro always returns 1 regardless of the <dev>/send_redirects.

how to debug/understand TCP/IP packets?

Julian

The RFC documents http://www.ietf.cnri.reston.va.us/rfc.html are your friends. The numbers you need:

793     TRANSMISSION CONTROL PROTOCOL
1122    Requirements for Internet Hosts -- Communication Layers
1812    Requirements for IP Version 4 Routers
826     An Ethernet Address Resolution Protocol

for tcpdump, see man tcpdump.

for Microsoft NT _server_

Steve (dot) Gonczi (at) networkengines (dot) com

there is a uSoft supplied packet capture utility as well.

also -W. Richard Stevens: TCP-IP Illustrated, Vol 1, a good intro into packet layouts and protocol basics. (anything by Stevens is good - Joe).

Ivan Figueredo idf (at) weewannabe (dot) com

for windump - http://netgroup-serv.polito.it/windump/

36.4. checking source routed packets

Packets leaving a LVS-DR realserver can have src_addr=VIP or src_addr=RIP. If the default gw is different for each packet, it would be nice to have a command line testing tool like ping or traceroute to test the route. The normal tools will create packets with src_addr=RIP and you won't be able to test the packets with src_addr=VIP.

Roberto Nibali ratz (at) tac (dot) ch 22 May 2001

maybe hping can help you.

Joe

Ah, the file hping2.8 is the man page i.e. {hping2}.8 - I thought it was v2.8 of hping.

How about:

ip route get $IP?
didn't know about "get". yes that works. It's like a -C with iptables. I'd still like to send a packet and see where it goes rather than getting an answer about where it is expected to go.

Julian

Not possible with src interface "lo" but possible with source address configured in "lo". Oh yes, "source interface" for some tools means "get one address from this iface and use it". In most of the cases these tools don't do the Right Thing.

from iproute2

$ ping -I src dst
or
arping -I if -s src dst

36.5. handling arp problem with iproute2

see Julian's notes and patches to handle the arp problem with iproute2 (this is somewhat developemental).

36.6. ip commands you mightn't know about

36.6.1. ip route get

from Julian

This will look at the routing tables and tell you the route to xxx.xxx.xxx.xxx

ip route get xxx.xxx.xxx.xxx

36.6.2. ip route append

If you already have a route from A to B, and want to add another, you can't, you have to append the extra route.

dynnema dynnema (at) yahoo (dot) com Mar 22 2002

Lets say I got one RS and two NAT DIRs.

 RS:
 RIP1:   192.168.1.2/24 dev eth0
 RIP2    192.168.2.2/24 dev eth0:10

 DIR1:
 VIP:    x.x.x.69        eth0:110
 DIP     192.168.1.1

 DIR2:
 VIP:    x.x.x.70        eth0:110
 DIP     192.168.2.1

I add the first route

ip route add src 192.168.1.2 via 192.168.1.1

but then I can't add the second route:

ip route add src 192.168.2.2 via 192.168.2.1:
"RTNETLINK answers: File exists"

Careful reading of IProute mailing list was very useful. It should be

ip route append src 192.168.2.2 via 192.168.2.1

36.7. Ratz's corrections on common iproute2 missconceptions

Ratz 25 Nov 2003

  • the basic problem with route/netstat -rn is, that they only see the main table, which is rather limited.
  • iproute2 very well knows the notion of ip aliases by using labels just like ifconfig. It's not up to the tool to decide if labels work or not. The misconception people have with ip aliasing is that people think an aliased interface is a logically separated interface while it is _not_. And this is the case since 2.1.128 or so.
  • ipchains doesn't recognize alias neither because since the _2.2.x_ kernel we moved to the iproute2 architecture, not in the 2.4.x as the howto lists. Packet filtering on aliased stopped working after the decay of ipfwadm in the old 2.0.x kernel days. Today you can still filter on so-called ip aliases but as the name implies you specify the IP ADDRESSS as a classifier and if you want to restrict it you add the underlying _physical_ interface definition to the classifying rule.
  • iproute2 is compatible with ifconfig/route/netstat but not vice versa. The two biggest issues people new to iproute2 have to struggle with are:

    • if you add secondary ip addresses without a label (alias interface) ifconfig is confused and doesn't print the information
    • if you add rules for branching into different routing tables than the main routing table, route or netstat -rn will not show you those routes. This also the case for blackhole, throw, unreachable and prohibit routes.

36.8. Ratz's wrappers (for iproute2)

One of the problems with the iproute2 utils is that the syntax is not machine readable (and difficult for humans too). Ratz has built some wrappers around these utils.

Ratz 25 Nov 2003

If you guys are interested I'll offer my first semi-official release of some of the replacement tools I've written for ifconfig/route. You can download them from Ratz's wrappers http://www.drugphish.ch/~ratz/iproute2/

It's still not really scriptable (I wrote it with really gross bash constructs and by using external tools ;). BUT, it solves some of architectural principles, such as separation of concern, correctness, flexibility, conceptional integrity, coupling and cohesion! You are given two tools to maintain almost everything network related. (I'm aware that iptables/netfilter and mii-tool, ethtool are also network related)

ifconfig gives you the (wrong) impression that eth0:0 is an interface, just as others in the output ifconfig -a. This is not true. The iproute2 tools correctly displays the relationship between aliases/labels and their corresponding physical interface.

Example:

laphish:~ # ifconfig -a | grep -A2 eth0
eth0      Link encap:Ethernet  HWaddr 00:20:E0:68:71:3A
           inet addr:172.23.2.131  Bcast:172.23.255.255  Mask:255.255.0.0
           inet6 addr: fe80::220:e0ff:fe68:713a/64 Scope:Link
--
eth0:0    Link encap:Ethernet  HWaddr 00:20:E0:68:71:3A
           inet addr:10.98.43.233  Bcast:10.98.43.255  Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
--
eth0:foo  Link encap:Ethernet  HWaddr 00:20:E0:68:71:3A
           inet addr:10.23.7.233  Bcast:10.23.7.255  Mask:255.255.255.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
laphish:~ #

Sure, one could argue that all HWaddr of those "interfaces" are the same and thus something with the interpretation of them being _real_ physical interfaces must be fishy. But it gives you the wrong idea of connection or entity relationship between link and ip layer.

Now let's compare the same output for iproute2:

laphish:~ # ip addr show dev eth0
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100
     link/ether 00:20:e0:68:71:3a brd ff:ff:ff:ff:ff:ff
     inet 172.23.2.131/16 brd 172.23.255.255 scope global eth0
     inet 10.23.7.233/24 brd 10.23.7.255 scope global eth0:foo
     inet 10.98.43.233/24 brd 10.98.43.255 scope global eth0:0
     inet 10.239.10.1/24 brd 10.239.10.255 scope global eth0
     inet6 fe80::220:e0ff:fe68:713a/64 scope link
laphish:~ #

As you can see we have a physical interface (link layer entity) called eth0 and associated with this interface we have 5 (not 4 like with ifconfig) IP addresses. And you can certainly well spot the labels which in ifconfig were displayed as independant interfaces at the end of each line starting with inet, right?

Plus there you certainly noted that in the second output we have one additional address which was not shown in the ifconfig output but is very well routable and _is_ a valid configuration. I simply didn't want to put an alias there.

Tools like ipchains and iptables and their underlying state machine are better off matching for ip addresses and the _one_ physical interface those are attached to then trying to fiddle around with a label that is optional and doesn't give you real valuable information. Additionally with iproute2 you have a better approach to conceptional integrity which is one of the key ingredients of architectures in that you say that even if I have multiple addresses for one interface I still send out the packet through the physical interface and not through a labeled, aliased or virtual interface.

ifconfig is an example of a "hiding complexity" tool. Hiding complexity is a concept the software industry has not yet adopted to the extent that we can trust it, and thus ifconfig is broken by design.

The reasons why people still use those deprecated tools are:

  • All other Unices still have those and it worked reliably for 10+ years
  • Most Linux Distributors except (notably) SuSE haven't switched their network setup to the iproute2 concept yet.
  • Documentation was seriously lacking and the tools are ... uhmm to a certain degree complex and incoherent in their syntax and semantics.