This documentation is a quick overview of how to add a new fault-tolerance protocol within
ProActive. A more complete version should be released with the version 3.3. If you wish to get
more informations, please feel free to send a mail to <[email protected]>
.
Fault-tolerance mechanism in ProActive is mainly based on the org.objectweb.proactive.core.body.ft.protocols.FTManager
class. This class contains
several hooks that are called before and after the main actions of an active object, e.g. sending
or receiving a message, serving a request, etc.
For example, with the Pessimistic Message Logging protocol (PML), messages are logged just before the delivery of the message to the active object. Main methods for the FTManager of the PML protocol are then:
/** * Message must be synchronously logged before being delivered. * The LatestRcvdIndex table is updated * @see org.objectweb.proactive.core.body.ft.protocols.FTManager#onDeli\ verReply(org.objectweb.proactive.core.body.reply.Reply) */ public int onDeliverReply(Reply reply) { // if the ao is recovering, message are not logged if (!this.isRecovering) { try { // log the message this.storage.storeReply(this.ownerID, reply); // update latestIndex table this.updateLatestRvdIndexTable(reply); } catch (RemoteException e) { e.printStackTrace(); } } return 0; } /** * Message must be synchronously logged before being delivered. * The LatestRcvdIndex table is updated * @see org.objectweb.proactive.core.body.ft.protocols.FTManager#onRece\ iveRequest(org.objectweb.proactive.core.body.request.Request) */ public int onDeliverRequest(Request request) { // if the ao is recovering, message are not logged if (!this.isRecovering) { try { // log the message this.storage.storeRequest(this.ownerID, request); // update latestIndex table this.updateLatestRvdIndexTable(request); } catch (RemoteException e) { e.printStackTrace(); } } return 0; }
The local variable this.storage
is remote reference to the checkpoint server. The
FTManager class contains a reference to each fault-tolerance server: fault-detector, checkpoint
storage and localization server. Those reference are initialized during the creation of the
active object.
A FTManager must define also a beforeRestartAfterRecovery()
method, which is
called when an active object is recovered. This method usually restore the state of the active
object so as to be consistent with the others active objects of the application.
For example, with the PML protocol, all the messages logged before the failure must be delivered
to the active object. The method beforeRestartAfterRecovery()
thus looks like:
/** * Message logs are contained in the checkpoint info structure. */ public int beforeRestartAfterRecovery(CheckpointInfo ci, int inc) { // recovery mode: received message no longer logged this.isRecovering = true; //get messages List replies = ci.getReplyLog(); List request = ci.getRequestLog(); // add messages in the body context Iterator itRequest = request.iterator(); BlockingRequestQueue queue = owner.getRequestQueue(); // requests while (itRequest.hasNext()) { queue.add((Request) (itRequest.next())); } // replies Iterator itReplies = replies.iterator(); FuturePool fp = owner.getFuturePool(); try { while (itReplies.hasNext()) { Reply current = (Reply) (itReplies.next()); fp.receiveFutureValue(current.getSequenceNumber(), current.getSourceBodyID(), current.getResult(), current\ ); } } catch (IOException e) { e.printStackTrace(); } // normal mode this.isRecovering = false; // enable communication this.owner.acceptCommunication(); try { // update servers this.location.updateLocation(ownerID, owner.getRemoteAdapter())\ ; this.recovery.updateState(ownerID, RecoveryProcess.RUNNING); } catch (RemoteException e) { logger.error('Unable to connect with location server'); e.printStackTrace(); } return 0; }
The parameter ci
is a org.objectweb.proactive.core.body.ft.checkpointing.CheckpointInfo
. This object
contains all the informations linked to the checkpoint used for recovering the active object, and
is used to restore its state. The programmer might defines his own class implementing CheckpointInfo
, to add needed informations, depending on the protocol.
ProActive include a global server that provide fault detection, active object localization,
resource service and checkpoint storage. For developing a new fault-tolerance protocol, the
programmer might specify the behavior of the checkpoint storage by extending the class org.objectweb.proactive.core.body.ft.servers.storage.CheckpointServerImpl
. For
example, only for the PML protocol and not for the CIC protocol, the checkpoint server must be
able to log synchronously messages. The other parts of the server can be used directly.
To specify the recovery algorithm, the programmer must extends the org.objectweb.proactive.core.body.ft.servers.recovery.RecoveryProcessImpl
. In the
case of the CIC protocol, all the active object of the application must recover after one
failure, while only the faulty process must restart with the PML protocol; this specific behavior
is coded in the recovery process.
© 2001-2007 INRIA Sophia Antipolis All Rights Reserved