Auto-Resuming in Unreliable Network

Home \| Table of Contents	Auto-Resuming in Unreliable Network	CloverETL 4.7.0
Prev	Cluster Reliability in Unreliable Network Environment	Next

Auto-Resuming in Unreliable Network

In version 4.4 we added auto-resuming of suspended nodes

Time-line describing the scenario:

NodeB is suspended after connection loss
0s NodeA successfully reestablishes connection to NodeB
120s NodeA changes NodeB status to "forced_resume"
NodeB attempts to resume itself if maximum auto-resume count is not reached.
If the connection is lost again the cycle repeats, if maximum auto-resume count is exceeded the node will remain suspended until the counter is reset. To prevent suspend-resume cycles.
240m auto-resume counter is reset

The following configuration properties serve to tune time intervals mentioned above:

cluster.node.check.intervalBeforeAutoresume - time (in ms) the node has to be accessible to be forcibly resumed (120000 by default)
cluster.node.check.maxAutoresumeCount - how many times a node may try to auto-resume itself (3 by default)
cluster.node.check.intervalResetAutoresumeCount=240 - time (in minutes) before autoresume counter will be reset

Prev	Up	Next
NodeB is Killed or It Cannot Connect to the Database	Home \| Table of Contents	Long-Term Network Malfunction May Cause Jobs to Hang on