Users being able to retrieve console logs from running instances is a boon for support—many times they can figure out what's going on inside their instance and fix what's going on without bothering you. Unfortunately, sometimes overzealous logging of failures can cause problems of its own.
A report came in: VMs were launching slowly, or not at all. Cue the standard checks—nothing on the Nagios, but there was a spike in network toward the current master of our RabbitMQ cluster. Investigation started, but soon the other parts of the queue cluster were leaking memory like a sieve. Then the alert came in: the master rabbit server went down. Connections failed over to the slave.
At that time, our control services were hosted by another team. We didn't have much debugging information to determine what was going on with the master, and couldn't reboot it. That team noted that the master failed without alert, but they managed to reboot it. After an hour, the cluster had returned to its normal state, and we went home for the day.
Continuing the diagnosis the next morning was kick-started
by another identical failure. We quickly got the
message queue running again and tried to work out why
Rabbit was suffering from so much network traffic.
Enabling debug logging on
nova-api
quickly brought
understanding. A tail -f
/var/log/nova/nova-api.log was scrolling by
faster than we'd ever seen before. CTRL+C on that, and we
could plainly see the contents of a system log spewing
failures over and over again—a system log from one of
our users' instances.
After finding the instance ID, we headed over to
/var/lib/nova/instances
to find the
console.log
:
adm@cc12:/var/lib/nova/instances/instance-00000e05# wc -l console.log 92890453 console.log adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log 5.5G console.log
Sure enough, the user had been periodically refreshing the console log page on the dashboard and the 5 GB file was traversing the rabbit cluster to get to the dashboard.
We called the user and asked her to stop for a while, and she was happy to abandon the horribly broken VM. After that, we started monitoring the size of console logs.
To this day, the issue (https://bugs.launchpad.net/nova/+bug/832507) doesn't have a permanent resolution, but we look forward to the discussion at the next summit.