Compute uses the database driver, which is the default driver, to track node
liveness.
In a compute worker, this driver periodically sends a db update command
to the database, saying “I'm OK” with a timestamp. A pre-defined
timeout (service_down_time
)
determines if a node is dead.
The driver has limitations, which may or may not be an issue for you, depending on your setup. The more compute worker nodes that you have, the more pressure you put on the database. By default, the timeout is 60 seconds so it might take some time to detect node failures. You could reduce the timeout value, but you must also make the DB update more frequently, which again increases the DB workload.
Fundamentally, the data that describes whether the node is alive is "transient" — After a few seconds, this data is obsolete. Other data in the database is persistent, such as the entries that describe who owns which VMs. However, because this data is stored in the same database, is treated the same way. The ServiceGroup abstraction aims to treat them separately.