public class TimedFailureMonitor extends DefaultFailureMonitor
FailureMonitor
with a time-based policy.
Note that, for safety reasons, this only sets a lower bound on when task is determined failed. Since during an outage system clocks can be accidentally misconfigured (for instance, when adding new nodes), we cannot rely on system time (since we might underestimate the wait), and so we must reset our clock from zero when the framework restarts. This unfortunately means that if the framework is also being frequently restarted, this detector may never trigger. A monotonic clock built on ZooKeeper could solve this, by recording each passing second, so that we only need to rely on the fact that the clock proceeds at 1 second per second, rather than on the clocks being synchronized across machines.
Constructor and Description |
---|
TimedFailureMonitor(java.time.Duration durationUntilFailed,
StateStore stateStore,
ConfigStore<ServiceSpec> configStore)
Creates a new
FailureMonitor that waits for at least a specified duration before deciding that the task
has failed. |
Modifier and Type | Method and Description |
---|---|
boolean |
hasFailed(TaskInfo terminatedTask)
Determines whether the given task has failed, by tracking the time delta between the first observed failure and
the current time.
|
public TimedFailureMonitor(java.time.Duration durationUntilFailed, StateStore stateStore, ConfigStore<ServiceSpec> configStore)
FailureMonitor
that waits for at least a specified duration before deciding that the task
has failed.durationUntilFailed
- The minimum amount of time which must pass before a stopped Task can be considered
failed.public boolean hasFailed(TaskInfo terminatedTask)
The first time a task is noticed to be failed, we record that time into a map, keyed by the task's TaskID
. Then, we return true if at least the configured amount of time has passed since then.
hasFailed
in interface FailureMonitor
hasFailed
in class DefaultFailureMonitor
terminatedTask
- The task that stopped and might be failed