Jobflow or master execution executing child jobs on another cluster nodes must be notified about status changes of their child jobs. When the asynchronous messaging doesn't work, events from the child jobs aren't delivered, so parent jobs keep running. When the network works again, the child job events may be re-transmitted, so hung parent job may be finished. However the network malfunction may be so long, that the event can't be re-transmitted.
job A running on NodeA executes job B running on NodeB
network between NodeA and NodeB is down from some reason
job B finishes and sends the “finished” event, however it can't be delivered to NodeA – event is stored in the “sent events buffer”
Since the network is down, also heart-beat can't be delivered and maybe HTTP connections can't be established, the cluster reacts as described in the sections above. Even though the nodes may be suspended, parent job A keeps waiting for the event from job B
Network finally starts working and since all undelivered events are in the “sent events buffer”, they are re-transmitted and all of them are finally delivered. Parent job A is notified and proceeds. It may fail later, since some cluster nodes may be suspended.
Network finally starts working, but number of the events sent during the malfunction exceeded “sent events buffer” limit size. So some messages are lost and won't be re-transmitted. Thus the buffer size limit should be higher in the environment with unreliable network. Default buffer size limit is 10000 events. It should be enough for thousands of simple job executions, basically it depends on number of job phases. Each job execution produces at least 3 events (job started, phase finished, job finished). Please note that there are also some other events fired occasionally (configuration changes, suspending, resuming, cache invalidation). Also messaging layer itself stores own messages to the buffer, but it's just tens messages in a hour. Heart-beat is not stored in the buffer.
There is also inbound events buffer used as temporary storage for events, so the events may be delivered in correct order when some events can't be delivered at the moment. When the cluster node is inaccessible, the inbound buffer is released after timeout, which is set to 1 hour by default.
Node B is restarted, so all undelivered events in the buffer are lost.
cluster.jgroups.protocol.NAKACK.gc_lag
– limit size of the sent events buffer;
Please note that each stored message takes 2kB of heap memory (default limit is 10000 events)
cluster.jgroups.protocol.NAKACK.xmit_table_obsolete_member_timeout
– inbound buffer timeout of unaccessible cluster node