By default, Compute uses the database driver to track node liveness. In a compute worker,
this driver periodically sends a db update command to the database,
saying “I'm OK” with a timestamp. Compute uses a pre-defined timeout
(service_down_time
) to determine whether a node is dead.
The driver has limitations, which can be an issue depending on your setup. The more compute worker nodes that you have, the more pressure you put on the database. By default, the timeout is 60 seconds so it might take some time to detect node failures. You could reduce the timeout value, but you must also make the database update more frequently, which again increases the database workload.
The database contains data that is both transient (whether the node is alive) and persistent (for example, entries for VM owners). With the ServiceGroup abstraction, Compute can treat each type separately.