Although the title of this story is much more dramatic than the actual event, I don't think, or hope, that I'll have the opportunity to use "Valentine's Day Massacre" again in a title.
This past Valentine's Day, I received an alert that a compute node was no longer available in the cloud—meaning,
$nova-manage service list
showed this particular node with a status of
XXX
.
I logged in to the cloud controller and was able to both ping and SSH into the problematic compute node, which seemed very odd. Usually when I receive this type of alert, the compute node has totally locked up and would be inaccessible.
After a few minutes of troubleshooting, I saw the following details:
A user recently tried launching a CentOS instance on that node
This user was the only user on the node (new node)
The load shot up to 8 right before I received the alert
The bonded 10gb network device (bond0) was in a DOWN state
The 1gb NIC was still alive and active
I looked at the status of both NICs in the bonded pair and saw that neither was able to communicate with the switch port. Seeing as how each NIC in the bond is connected to a separate switch, I thought that the chance of a switch port dying on each switch at the same time was quite improbable. I concluded that the 10gb dual port NIC had died and needed to be replaced. I created a ticket for the hardware support department at the data center where the node was hosted. I felt lucky that this was a new node and no one else was hosted on it yet.
An hour later I received the same alert, but for another compute node. Crap. OK, now there's definitely a problem going on. Just as with the original node, I was able to log in by SSH. The bond0 NIC was DOWN, but the 1gb NIC was active.
And the best part: the same user had just tried creating a CentOS instance. What?
I was totally confused at this point, so I texted our network admin to see if he was available to help. He logged in to both switches and immediately saw the problem: the switches detected spanning tree packets coming from the two compute nodes and immediately shut the ports down to prevent spanning tree loops:
Feb 15 01:40:18 SW-1 Stp: %SPANTREE-4-BLOCK_BPDUGUARD: Received BPDU packet on Port-Channel35 with BPDU guard enabled. Disabling interface. (source mac fa:16:3e:24:e7:22) Feb 15 01:40:18 SW-1 Ebra: %ETH-4-ERRDISABLE: bpduguard error detected on Port-Channel35. Feb 15 01:40:18 SW-1 Mlag: %MLAG-4-INTF_INACTIVE_LOCAL: Local interface Port-Channel35 is link down. MLAG 35 is inactive. Feb 15 01:40:18 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Port-Channel35 (Server35), changed state to down Feb 15 01:40:19 SW-1 Stp: %SPANTREE-6-INTERFACE_DEL: Interface Port-Channel35 has been removed from instance MST0 Feb 15 01:40:19 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet35 (Server35), changed state to down
He reenabled the switch ports, and the two compute nodes immediately came back to life.
Unfortunately, this story has an open ending... we're still looking into why the CentOS image was sending out spanning tree packets. Further, we're researching a proper way for how to mitigate this from happening. It's a bigger issue than one might think. While it's extremely important for switches to prevent spanning tree loops, it's very problematic to have an entire compute node be cut from the network when this happens. If a compute node is hosting 100 instances and one of them sends a spanning tree packet, that instance has effectively DDOS'd the other 99 instances.
This is an ongoing and hot topic in networking circles—especially with the rise of virtualization and virtual switches.