Red Hat Docs > Manuals > Red Hat High Availability Server Manuals > |
Red Hat Linux 6.2: The Official Red Hat High Availability Server Installation Guide | ||
---|---|---|
Prev | Chapter 7. Failover Services (FOS) | Next |
While configuring FOS is a relatively straightforward process, unless care is taken it is easy to define a non-functioning cluster, or one that "ping-pongs" a failover back and forth between the nodes. This is partly because all the services failover as a group, so in order for FOS to operate, all services on at least one node in the cluster must come up correctly. If one service on node A fails to come up (due to a configuration error), and a failover occurs, but a different service on node B also fails to come up, then the result is a cluster where neither node comes up correctly and failover will bounce back and forth.
Before starting Piranha using a newly created or modified configuration file, some basic tests should be performed first. These tests should be carried out on both nodes in the cluster.
Ensure that both systems are using the same configuration by copying the piranha configuration file to the other cluster node(s). You can use either rcp or scp, depending on whether you've configured rsh or ssh:
# rcp /etc/lvs.cf other.cluster.node:/etc/lvs.cf other.cluster.node: Connection refused Trying krb4 rcp... other.cluster.node: Connection refused trying normal rcp (/usr/bin/rcp) # |
Without Piranha running, bring up the IP services by manually typing the same command that is defined in the start_cmd line for each service's entry in the configuration file. Then use the ps command and make sure the services are running as expected.
Next, type the same command that is defined in the stop_cmd line for each service's entry in the configuration file. Then use the ps command to make sure the services have been shut down properly.
To continue testing, bring up the IP services again by manually typing the same command that is defined in the start_cmd line for each service's entry in the configuration file.
This test ensures that the port used by each service is correct, and that the strings used in service monitoring reflect the actual behavior of each service. One of the most common configuration errors is that a service's port number (as defined in the cluster configuration file) does not match the port number actually used by the service.
With the services already running (see the section called Starting and Stopping Services Manually), use the telnet command to attempt to connect to that service's TCP/IP port number (as defined in the configuration file). For example, if your cluster is providing http service on port 4040, use the following command to confirm that port 4040 is, in fact, in use:
# telnet localhost 4040 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. |
Because we didn't get a Connection refused error message, we can be confident that a service is using port 4040 [1] . At this point, we can disconnect from the port in this manner:
^] telnet> quit Connection closed. # |
Please Note | |
---|---|
When telnet is used to connect to some services, you may find you cannot disconnect. In this case, you will have to use the kill command on telnet. |
If you did get a Connection refused message, then the service you started is not using that port and Piranha will fail to connect also. On the other hand, if telnet did connect sucessfully, then FOS will be able to connect as well.
Note, however, that not every service can be considered "telnet-friendly" — some services may drop the connection after a period of time, or there might be no response at all (try pressing Enter several times). This does not necessarily indicate that the service is not functioning. Some services simply cannot be accessed in an interactive telnet session.
As a final test, try using telnet to connect to the service from the other cluster node. If this fails, then FOS will also fail. Resolve any problems preventing telnet from connecting and it is likely that the nanny daemon will also be able to connect.
If you have defined send strings for a service, you can try typing the contents of the send string and see if it the service responds appropriately. However, in some cases you will not be able to type it fast enough to prevent the connection from timing out and failing to process what you've typed. This does not necessarily indicate that there is any problem; just that you cannot use telnet to test. The only way to be sure is to see if the nanny daemon can use the send and expect strings sucessfully.
Start Piranha by starting the pulse daemon on one cluster node (have that node unplugged from the network if necessary). Depending on the node you are using, FOS will either attempt to start the IP services right away, or will start the nanny daemon(s), which will then timeout, initiate a failover, and start the IP services. In either case, you should end up with a running environment where pulse is running, fos is running in --active mode, and the IP service(s) are up. You should be able to see this using the ps -axw command.
You should also examine the system log file (/var/log/messages) and read the Piranha-related entries. Make sure they do not indicate any problem (see the section called Error Messages for a discussion of possible error messages). If a problem is logged, make sure you resolve it before proceeding.
Repeat this process for the other cluster node.
If you disconnect both cluster nodes from the network, and start Piranha on each, you should end up with each node believing that the other has failed and both nodes should be running with active services. If you then connect the nodes to the network while they are still running, the two systems will detect each other's heartbeat. Since both nodes are claiming to be active, the node that is defined as the backup node should become inactive by shutting down its IP services and starting the nanny daemon(s).
The easiest way to test failover is to unplug the active node from the network. The inactive node should detect the loss of a heartbeat message and become active. If you then shut down Piranha on the unplugged system, reconnect it to the network, and restart Piranha, the reconnected system should detect that FOS is already active on the other node and no failover should occur to interrupt the running service.
If you reconnect the disconnected node with Piranha already running, then Piranha will detect that two active systems are running; The one defined by the backup entry of the configuration file should become inactive (even if that results in another failover).
The easiest way to cause a failover due to the loss of a single service is to stop the service by manually executing the command defined by that service's stop entry in the Piranha configuration file. The command should be issued on the active system. This will stop the service, which should cause the nanny daemon on the inactive node to log a service failure and trigger a failover.
[1] | Of course, this does not mean that the service we expected is using the port. We'll get to that in a moment. |