If you have an existing high availability (HA) infrastructure based on a shared file system (SAN file system), it is relatively easy to set up a fault-tolerant cluster of brokers using this technology. Brokers automatically configure themselves to operate in master mode or slave mode, depending on whether or not they manage to grab an exclusive lock on the underlying data directory. Replication of data is managed by the underlying SAN file system.
A storage area network (SAN) is a storage system that enables you to attach remote storage devices (such as disk drives or tape drives) to a computer, making them appear as if they were local storage devices. A distinctive feature of the SAN architecture is that it combines data centres, physically separated by large distances, into a single storage network. With the addition of suitable software or hardware, a SAN system can be designed to provide data replication across multiple data centres, making it ideal for disaster recovery.
Because of the exceptional demands for high speed and reliability in a storage network (where gigabit bandwidths are required), SANs were originally built using dedicated fibre-optic or twisted copper wire cables and associated protocols, such as Fiber Channel (FC), were developed for this purpose. Alternative network protocols, such as FCoE and iSCSI, have also been developed to enable SANS to be built over high-speed Ethernet networks.
For more details about SANs, see the storage area network Wikipedia article.
The SAN itself defines only block-level access to storage devices. Another layer of software, the SAN file system or shared disk file system, is needed to provide the file-level access, implementing the file and directory abstractions. Because the SAN file system can be shared between multiple computers, it must also be capable of regulating concurrent access to files. File locking is, therefore, an important feature of a SAN file system.
A SAN file system must implement an efficient and reliable system of file locking to ensure that different computers cannot write to the same file at the same time. The shared file system master/slave failover pattern depends on a reliable file locking mechanism in order to function correctly.
![]() | Warning |
---|---|
OCFS2 is incompatible with this failover pattern, because mutex file locking from Java is not supported. |
![]() | Warning |
---|---|
NFSv3 is incompatible with this failover pattern. In the event of an abnormal termination of a master broker, which is an NFSv3 client, the NFSv3 server does not time out the lock held by the client. This renders the Fuse Message Broker data directory inaccessible, because the slave broker cannot acquire the lock and therefore cannot start up. In this case, the only way to unblock the failover cluster in NFSv3 is to reboot all broker instances. On the other hand, NFSv4 is compatible with this failover pattern, because its design includes timeouts for locks. When an NFSv4 client holding a lock terminates abnormally, the lock is automatically released after 30 seconds, allowing another NFSv4 client to grab the lock. |
In the shared file system master/slave pattern, there is nothing special to distinguish a master broker from the slave brokers. There can be any number of brokers in a failover cluster. Membership of a particular failover cluster is defined by the fact that all of the brokers in the cluster use the same persistence layer and store their data in the same shared directory. The brokers in the cluster therefore compete to grab the exclusive lock on the data file. The first broker to grab the exclusive lock is the master and all of the other brokers in the cluster are the slaves. The master and the slaves now behave as follows:
The master retains the exclusive lock on the data file, preventing the other brokers from accessing the data. The master starts up its transport connectors and network connectors, enabling other messaging clients and message brokers to connect to it.
The slaves keep attempting to grab the lock on the data file, but they do not succeed as long as the master is running. The slaves do not start up any transport connectors or network connectors and are thus inaccessible to messaging clients and brokers.
The only condition that brokers in a cluster must satisfy is that they all
must use the same persistence layer and the persistence
layer must put its data into a directory in a shared file
system. For example, assuming that
/sharedFileSystem/sharedBrokerData
is a directory in a shared
file system, you could configure a Kaha DB persistence layer as follows:
Example 4.5. Shared file cluster configuration
<persistenceAdapter> <kahaDB directory="/sharedFileSystem/sharedBrokerData"/> </persistenceAdapter>
Alternatively, the AMQ persistence layer is also suitable for this failover scenario:
Example 4.6. Alternate shared file cluster configuration
<persistenceAdapter> <amqPersistenceAdapter directory="/sharedFileSystem/sharedBrokerData"/> </persistenceAdapter>
Clients of the failover cluster must be configured with a failover URL that
lists the URLs for all of the brokers in the cluster. For example, assuming that
there are three brokers in the cluster, deployed on the hosts,
broker1
, broker2
, and broker3
, and
all listening on IP port 61616
, you could use the following
failover URL for the clients:
failover:(tcp://broker1:61616,tcp://broker2:61616,tcp://broker3:61616)
In this case, it does not matter in which order the clients attempt to connect to the brokers, because the identity of the master broker is determined by chance: that is, by whichever broker is the first to grab the exclusive lock on the data file.
Figure 4.3 shows the initial state of a shared file system master/slave cluster. When all of the brokers in the cluster are started, one of them grabs the exclusive lock on the broker data file, thus becoming the master. All of the other brokers in the clusters remain slaves and pause while waiting for the exclusive lock to be freed up. Only the master starts its transport connectors, so all of the clients connect to it.
Figure 4.4 shows the state of the
cluster after the original master has shut down or failed. As soon as the master
gives up the lock (or after a suitable timeout, if the master crashes), the lock
on the broker data file frees up and another broker in the cluster grabs the
lock and gets promoted to master (broker2
in the figure).
After the clients lose their connection to the original master, they automatically try all of the other brokers listed in the failover URL. This enables them to find and connect to the new master.
You can restart the failed master at any time and it will rejoin the cluster. Initially, however, it will have the status of a slave broker, because one of the other brokers already owns the exclusive lock on the broker data file, as shown in Figure 4.5.