At the end of 2012, Cybera (a nonprofit with a mandate to oversee the development of cyberinfrastructure in Alberta, Canada) deployed an updated OpenStack cloud for their DAIR project (http://www.canarie.ca/en/dair-program/about). A few days into production, a compute node locked up. Upon rebooting the node, I checked to see what instances were hosted on that node so I could boot them on behalf of the customer. Luckily, only one instance.
The nova reboot command wasn't working, so
I used virsh, but it immediately came back
with an error saying it was unable to find the backing
disk. In this case, the backing disk is the glance image
that is copied to
/var/lib/nova/instances/_base
when the
image is used for the first time. Why couldn't it find it?
I checked the directory, and sure enough it was
gone.
I reviewed the nova
database and saw the
instance's entry in the nova.instances
table.
The image that the instance was using matched what
virsh
was reporting, so no inconsistency there.
I checked glance and noticed that this image was a snapshot that the user created. At least that was good news—this user would have been the only user affected.
Finally, I checked StackTach and reviewed the user's
events. They had created and deleted several snapshots—most
likely experimenting. Although the timestamps
didn't match up, my conclusion was that they launched
their instance and then deleted the snapshot and it was
somehow removed from
/var/lib/nova/instances/_base
. None of
that made sense, but it was the best I could come up
with.
It turns out the reason that this compute node locked up was a hardware issue. We removed it from the DAIR cloud and called Dell to have it serviced. Dell arrived and began working. Somehow or another (or a fat finger), a different compute node was bumped and rebooted. Great.
When this node fully booted, I ran through the same scenario of seeing what instances were running so that I could turn them back on. There were a total of four. Three booted and one gave an error. It was the same error as before: unable to find the backing disk. Seriously, what?
Again, it turns out that the image was a snapshot. The three other instances that successfully started were standard cloud images. Was it a problem with snapshots? That didn't make sense.
A note about DAIR's architecture:
/var/lib/nova/instances
is a shared NFS
mount. This means that all compute nodes have access to
it, which includes the _base
directory.
Another centralized area is /var/log/rsyslog
on the cloud controller. This directory collects all
OpenStack logs from all compute nodes. I wondered if there
were any entries for the file that virsh is
reporting:
dair-ua-c03/nova.log:Dec 19 12:10:59 dair-ua-c03 2012-12-19 12:10:59 INFO nova.virt.libvirt.imagecache [-] Removing base file: /var/lib/nova/instances/_base/7b4783508212f5d242cbf9ff56fb8d33b4ce6166_10
Ah-hah! So OpenStack was deleting it. But why?
A feature was introduced in Essex to periodically check
and see whether there were any _base
files not in use.
If there
were, nova would delete them. This idea sounds innocent
enough and has some good qualities to it. But how did this
feature end up turned on? It was disabled by default in
Essex. As it should be. It was decided to enable it in Folsom
(https://bugs.launchpad.net/nova/+bug/1029674). I cannot
emphasize enough that:
Actions that delete things should not be enabled by default.
Disk space is cheap these days. Data recovery is not.
Secondly, DAIR's shared
/var/lib/nova/instances
directory
contributed to the problem. Since all compute nodes have
access to this directory, all compute nodes periodically
review the _base directory. If there is only one instance
using an image, and the node that the instance is on is
down for a few minutes, it won't be able to mark the image
as still in use. Therefore, the image seems like it's not
in use and is deleted. When the compute node comes back
online, the instance hosted on that node is unable to
start.