At the end of August 2012, a post-secondary school in Alberta, Canada, migrated its infrastructure to an OpenStack cloud. As luck would have it, within the first day or two of it running, one of its servers just disappeared from the network. Blip. Gone.
After restarting the instance, everything was back up and running. We reviewed the logs and saw that at some point, network communication stopped and then everything went idle. We chalked this up to a random occurrence.
A few nights later, it happened again.
We reviewed both sets of logs. The one thing that stood out the most was DHCP. At the time, OpenStack, by default, set DHCP leases for one minute (it's now two minutes). This means that every instance contacts the cloud controller (DHCP server) to renew its fixed IP. For some reason, this instance could not renew its IP. We correlated the instance's logs with the logs on the cloud controller and put together a conversation:
Instance tries to renew IP.
Cloud controller receives the renewal request and sends a response.
Instance "ignores" the response and resends the renewal request.
Cloud controller receives the second request and sends a new response.
Instance begins sending a renewal request to
255.255.255.255
since it hasn't heard back from the cloud controller.The cloud controller receives the
255.255.255.255
request and sends a third response.The instance finally gives up.
With this information in hand, we were sure that the problem had to do with DHCP. We thought that, for some reason, the instance wasn't getting a new IP address, and with no IP, it shut itself off from the network.
A quick Google search turned up this: DHCP lease errors in VLAN mode (https://lists.launchpad.net/openstack/msg11696.html), which further supported our DHCP theory.
An initial idea was to just increase the lease time. If the instance renewed only once every week, the chances of this problem happening would be tremendously smaller than every minute. This didn't solve the problem, though. It was just covering the problem up.
We decided to have tcpdump run on this instance and see whether we could catch it in action again. Sure enough, we did.
The tcpdump looked very, very weird. In short, it looked as though network communication stopped before the instance tried to renew its IP. Since there is so much DHCP chatter from a one-minute lease, it's very hard to confirm it, but even with only milliseconds difference between packets, if one packet arrives first, it arrives first, and if that packet reported network issues, then it had to have happened before DHCP.
Additionally, the instance in question was responsible for a very, very large backup job each night. While "the Issue" (as we were now calling it) didn't happen exactly when the backup happened, it was close enough (a few hours) that we couldn't ignore it.
More days go by and we catch the Issue in action more and more. We find that dhclient is not running after the Issue happens. Now we're back to thinking it's a DHCP issue. Running /etc/init.d/networking restart brings everything back up and running.
Ever have one of those days where all of the sudden you get the Google results you were looking for? Well, that's what happened here. I was looking for information on dhclient and why it dies when it can't renew its lease, and all of the sudden I found a bunch of OpenStack and dnsmasq discussions that were identical to the problem we were seeing!
Problem with Heavy Network IO and Dnsmasq (http://www.gossamer-threads.com/lists/openstack/operators/18197)
instances losing IP address while running, due to No DHCPOFFER (http://www.gossamer-threads.com/lists/openstack/dev/14696)
Seriously, Google.
This bug report was the key to everything: KVM images lose connectivity with bridged network (https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978)
It was funny to read the report. It was full of people who had some strange network problem but didn't quite explain it in the same way.
So it was a QEMU/KVM bug.
At the same time I found the bug report, a co-worker was able to successfully reproduce the Issue! How? He used iperf to spew a ton of bandwidth at an instance. Within 30 minutes, the instance just disappeared from the network.
Armed with a patched QEMU and a way to reproduce, we set out to see if we had finally solved the Issue. After 48 straight hours of hammering the instance with bandwidth, we were confident. The rest is history. You can search the bug report for "joe" to find my comments and actual tests.