Removing Cyclic Dependencies

In Section 31.6.4, we mentioned that factoring the _names set and its mutex into a separate class instance does not really solve the cyclic dependency problem, at least not in general. To see why, suppose that we want to extend our factory with a new getDetails operation:

// ...

struct Details {
    PhoneEntry* proxy;
    string name;
    string phNum;
};

sequence<Details> DetailsSeq;

interface PhoneEntryFactory {
    // ...

    DetailsSeq getDetails();
};

This type of operation is common in collection managers: instead of returning a simple list of proxies, getDetails returns a sequence of structures, each of which contains not only the object’s proxy, but also some of the state of the corresponding object. The motivation for this is performance: with a plain list of proxies, the client, once it has obtained the list, is likely to immediately follow up with one or more remote calls for each object in the list in order to retrieve their state (for example, to display the list of objects to the user). Making all these additional remote procedure calls is inefficient, and an operation such as getDetails gets the job done with a single RPC instead.

To implement getDetails in the factory, we need to iterate over the set of entries and invoke the getNumber operation on each object. (These calls are collocated and therefore very efficient, so they do not suffer the performance problem that a client calling the same operations would suffer.) However, this is potentially dangerous because the following sequence of events is possible:

• The implementation of destroy locks the entry’s mutex _m, sets the _destroyed flag, and then calls remove, which attempts to lock _lcMutex. However, _lcMutex is already locked by getDetails, so remove blocks until _lcMutex is unlocked again.

• getDetails, while iterating over its set of entries, happens to call getNumber on the entry that is currently being destroyed by client B. getNumber, in turn, tries to lock its mutex _m, which is already locked by destroy.

At this point, the server deadlocks: getDetails holds a lock on _lcMutex and waits for _m to become available, and destroy holds a lock on _m and waits for _lcMutex to become available, so neither thread can make progress.

31.7.1 Deadlock-Free Lock Acquisition

For our example, it is fairly easy to avoid the deadlock: instead of holding the lock for the duration of destroy, we set the _destroyed flag under protection of the lock and unlock _m again before calling remove on the factory:

void
PhoneEntryI::destroy(const Current& c)
{
    {
        Mutex::Lock lock(_m);

        if (_destroyed)
            throw ObjectNotExistException(__FILE__, __LINE__);

        _destroyed = true;

    } // _m is unlocked here.

    _factory‑>remove(_name, c.adapter);
}

Now deadlock is impossible because no function holds more than one lock, and no function calls another function while it holds a lock. However, rearranging locks in this fashion can be quite difficult for complex applications. In particular, if an application uses callbacks that do complex things involving several objects, it can be next to impossible to prove that the code is free of deadlocks. The same is true for applications that use condition variables and suspend threads until a condition becomes true.

At the core of the problem is that concurrency can create circular locking dependencies: an operation on the factory (such as getDetails) can require the same locks as a concurrent call to destroy. This is one reason why threaded code is harder to write than sequential code—the interactions among operations require locks, but dependencies among these locks are not obvious. In effect, locks set up an entirely separate and largely invisible set of dependencies. For example, it was easy to spot the mutual dependency between the factory and the servants due to the presence of remove; in contrast, it was much harder to spot the lurking deadlock in destroy. Worse, deadlocks may not be found during testing and discovered only after deployment, when it is much more expensive to rectify the problem.

31.7.2 Reaping

Instead of trying to arrange code such that is deadlock-free in the presence of callbacks, it is often easier to change the code to avoid the callbacks entirely and to use an approach known as reaping:

• Whenever a collection manager operation, such as list, getDetails, or find is called, the factory checks for destroyed servants and, if it finds any, removes them from the _names set.

A Simple Reaping Implementation

To implement reaping, we need to change our PhoneEntryI definition a little. It no longer has a static _factory smart pointer back to the factory (because it no longer calls remove). Instead, the servant now provides a member function _isZombie that the factory calls to check whether the servant was destroyed some time in the past:

class PhoneEntryI : public PhoneEntry {
public:
    // ...

    bool _isZombie() const;

private:
    const string _name;
    string _phNum;
    bool   _destroyed;
    Mutex  _m;
};

bool
PhoneEntryI::_isZombie() const
{
Mutex::Lock lock(_m);

return _destroyed;
}

void
PhoneEntryI::destroy(const Current& c)
{
    Mutex::Lock lock(_m);

    if (_destroyed)
        throw ObjectNotExistException(__FILE__, __LINE__);

    _destroyed = true;
    c.adapter‑>remove(c.id);
}

class PhoneEntryFactoryI : public PhoneEntryFactory
{
public:
    // Constructor and Slice operations here...

private:
    typedef map<string, PhoneEntryIPtr> PMap;
    PMap _entries;
    Mutex _lcMutex;
};

PhoneEntryPrx
PhoneEntryFactory::create(const string& name,
                          const string& phNum,
                          const Current& c)
{
    Mutex::Lock lock(_lcMutex);

    PhoneEntryPrx pe;
    PhoneEntryIPtr servant = new PhoneEntryI(name, phNum);

    // Try to create new object.
    //
    try {
        CommunicatorPtr comm = c.adapter‑>getCommunicator();
        pe = PhoneEntryPrx::uncheckedCast(c.adapter‑>add(
                                servant,
                                comm‑>stringToIdentity(name)));
    } catch (const Ice::AlreadyRegisteredException&) {
        throw PhoneEntryExists(name, phNum);
    }

    // Scan for zombies.
    //
    PMap::iterator i = _entries.begin();
    while (i != _entries.end())
    {
        if (i‑>second‑>_isZombie())
            _entries.erase(i++);
        else
            ++i;
    }
    _entries[name] = servant;

    return pe;
}

PhoneEntries
PhoneEntryFactoryI::list(const Current& c)
{
    Mutex::Lock lock(_lcMutex);

    CommunicatorPtr comm = c.adapter‑>getCommunicator();

    PhoneEntries pe;
    PMap::iterator i = _entries.begin();
    for (i = _entries.begin(); i != _entries.end(); ++i) {
        if (i‑>second‑>_isZombie()) {
            _entries.erase(i++);
        } else {
            ObjectPrx o = c.adapter‑>createProxy(
                                comm‑>stringToIdentity(i‑>first));
            pe.push_back(PhoneEntryPrx::uncheckedCast(o));
            ++i;
        }
    }

    return pe;
}

// Similar for getDetails and find...

This is a much cleaner design: there is no cyclic dependency between the factory and the servants, either implicit (in the type system) or explicit (as a locking dependency). Moreover, the implementation is easier to understand once you get used to the idea of reaping: there is no need to follow complex callbacks and to carefully analyze the order of lock acquisition. (Note that, depending on how state is maintained for servants, you may also need to reap during start-up and shutdown.)

In general, we recommend that you use a reaping approach in preference to callbacks for all but the most trivial applications: it simply is a better approach that is easier to maintain and understand.

Alternative Reaping Implementations

You may be concerned that reaping increases the cost of create from O(log n) to O(n) because create now iterates over all existing entries and locks and unlocks a mutex in each servant (whereas, previously, it simply added each new servant to the _names set). Often, this is not an issue because life cycle operations are called infrequently compared to normal operations. However, you will notice the additional cost if you have a large number of servants (in the thousands or more) and life cycle operations are called frequently.

If you find that create is a bottleneck (by profiling, not by guessing!), you can change to a more efficient implementation by adding zombie servants to a separate zombie list. Reaping then iterates over the zombie list and removes each servant in the zombie list from the _entries map before clearing the zombie list. This reduces the cost of reaping to be proportional to the number of zombie servants instead of the total number of servants. In addition, it allows us to remove the _isZombie member function and to lock and unlock _lcMutex only once instead of locking a mutex in each servant as part of _isZombie. We will see such an implementation in Section 31.10.

You may also be concerned about the number of zombie servants that can accumulate in the server if create is not called for some time. For most applications, this is not a problem: the servants occupy memory, but no other resources because destroy can clean up scarce resources, such as file descriptors or network connections before it turns the servant into a zombie. If you really need to prevent accumulation of zombie servants, you can reap from a background thread that runs periodically, or you can count the number of zombies and trigger a reaping pass once that number exceeds some threshold.

31.7 Removing Cyclic Dependencies

31.7.1 Deadlock-Free Lock Acquisition

31.7.2 Reaping

A Simple Reaping Implementation

Alternative Reaping Implementations