Adding a Fault-Tolerance Protocol

This documentation is a quick overview of how to add a new fault-tolerance protocol within ProActive. A more complete version should be released with the version 3.3. If you wish to get more informations, please feel free to send a mail to .

42.1. Overview

42.1.1. Active Object side

Fault-tolerance mechanism in ProActive is mainly based on the org.objectweb.proactive.core.body.ft.protocols.FTManager class. This class contains several hooks that are called before and after the main actions of an active object, e.g. sending or receiving a message, serving a request, etc.

For example, with the Pessimistic Message Logging protocol (PML), messages are logged just before the delivery of the message to the active object. Main methods for the FTManager of the PML protocol are then:

/**
     * Message must be synchronously logged before being delivered.
     * The LatestRcvdIndex table is updated
     * @see org.objectweb.proactive.core.body.ft.protocols.FTManager#onDeli\
verReply(org.objectweb.proactive.core.body.reply.Reply)
     */
    public int onDeliverReply(Reply reply) {
        // if the ao is recovering, message are not logged
        if (!this.isRecovering) {
            try {
                // log the message
                this.storage.storeReply(this.ownerID, reply);
                // update latestIndex table
                this.updateLatestRvdIndexTable(reply);
            } catch (RemoteException e) {
                e.printStackTrace();
            }
        }
        return 0;
    }
    /**
     * Message must be synchronously logged before being delivered.
     * The LatestRcvdIndex table is updated
     * @see org.objectweb.proactive.core.body.ft.protocols.FTManager#onRece\
iveRequest(org.objectweb.proactive.core.body.request.Request)
     */
    public int onDeliverRequest(Request request) {
        // if the ao is recovering, message are not logged
        if (!this.isRecovering) {
            try {
                // log the message
                this.storage.storeRequest(this.ownerID, request);
                // update latestIndex table
                this.updateLatestRvdIndexTable(request);
            } catch (RemoteException e) {
                e.printStackTrace();
            }
        }
        return 0;
    }

The local variable this.storage is remote reference to the checkpoint server. The FTManager class contains a reference to each fault-tolerance server: fault-detector, checkpoint storage and localization server. Those reference are initialized during the creation of the active object.

A FTManager must define also a beforeRestartAfterRecovery() method, which is called when an active object is recovered. This method usually restore the state of the active object so as to be consistent with the others active objects of the application.

For example, with the PML protocol, all the messages logged before the failure must be delivered to the active object. The method beforeRestartAfterRecovery() thus looks like:

/**
     * Message logs are contained in the checkpoint info structure.
     */
    public int beforeRestartAfterRecovery(CheckpointInfo ci, int inc) {
        // recovery mode: received message no longer logged
        this.isRecovering = true;
         //get messages
        List replies = ci.getReplyLog();
        List request = ci.getRequestLog();
         // add messages in the body context
        Iterator itRequest = request.iterator();
        BlockingRequestQueue queue = owner.getRequestQueue();
        // requests
        while (itRequest.hasNext()) {
            queue.add((Request) (itRequest.next()));
        }
        // replies
        Iterator itReplies = replies.iterator();
        FuturePool fp = owner.getFuturePool();
        try {
            while (itReplies.hasNext()) {
                Reply current = (Reply) (itReplies.next());
                fp.receiveFutureValue(current.getSequenceNumber(),
                    current.getSourceBodyID(), current.getResult(), current\
);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        // normal mode
        this.isRecovering = false;
        // enable communication
        this.owner.acceptCommunication();
        try {
            // update servers
            this.location.updateLocation(ownerID, owner.getRemoteAdapter())\
;
            this.recovery.updateState(ownerID, RecoveryProcess.RUNNING);
        } catch (RemoteException e) {
            logger.error('Unable to connect with location server');
            e.printStackTrace();
        }
        return 0;
    }

The parameter ci is a org.objectweb.proactive.core.body.ft.checkpointing.CheckpointInfo . This object contains all the informations linked to the checkpoint used for recovering the active object, and is used to restore its state. The programmer might defines his own class implementing CheckpointInfo , to add needed informations, depending on the protocol.

42.1.2. Server side

ProActive include a global server that provide fault detection, active object localization, resource service and checkpoint storage. For developing a new fault-tolerance protocol, the programmer might specify the behavior of the checkpoint storage by extending the class org.objectweb.proactive.core.body.ft.servers.storage.CheckpointServerImpl . For example, only for the PML protocol and not for the CIC protocol, the checkpoint server must be able to log synchronously messages. The other parts of the server can be used directly.

To specify the recovery algorithm, the programmer must extends the org.objectweb.proactive.core.body.ft.servers.recovery.RecoveryProcessImpl . In the case of the CIC protocol, all the active object of the application must recover after one failure, while only the faulty process must restart with the PML protocol; this specific behavior is coded in the recovery process.