Red Hat Linux 6.2: The Official Red Hat High Availability Server Installation Guide
Prev	Chapter 7. Failover Services (FOS)	Next

Trouble-shooting FOS

This section describes the most common problems and causes/solutions. This section, along with the earlier section describing how to test FOS, should help resolve most situations.

Common Problems / Questions

	Please Note
	Most of the components that make up Piranha can be run manually (provided you supply the same option switches that they require when run as daemons). In addition, most components also support a -v and/or a --norun option to aid in the debugging process.

The `nanny` daemon keeps reporting that a service is not working

If you are using send/expect strings as part of service testing, try using FOS without them. Without these strings, FOS will test the service by connecting to the service's port only. If this resolves the problem, it is likely that the service is not passing your send/expect string testing.

FOS keeps performing failovers even though the services are running

Even though the services are running, it does not necessarily mean that FOS can connect to them. The best test for this is to bring up FOS on just one cluster node (which will cause all the services to be started), and use telnet from the other cluster node to attempt to connect to the TCP/IP port defined for each service. If telnet returns a Connection refused message, then it is likely that the nanny daemon will also fail to connect to the service. Resolve the situation preventing telnet from connecting and the nanny daemon should stop initiating failovers.

The most common causes for this failure are either that the service has been configured to use a different TCP/IP port number than the one specified in the Piranha configuration file, or the Piranha configuration files are not identical on the cluster nodes.

Next, shut down Piranha on the active node and bring it up on the inactive node, then perform the telnet tests in the opposite direction.

Also check the system log files (on both nodes). For any failover condition, Piranha will log the reason.

If Apache is one of the services, make sure that the LISTEN port number matches the port number in the Piranha configuration file.

FOS keeps "ping-ponging" the failover back and forth

First see the previous section on "FOS keeps performing failover even though the services are running", and perform the telnet testing described.

If you still experience this problem, make sure that your keepalive, deadtime, and the service's timeout values are not too short. Try increasing the values to ensure that Piranha has sufficient time to correctly determine a failure. Also note that deadtime should have a value that is a multiple of the keepalive value.

Piranha did not shut down correctly; how do I kill it without causing it to restart?

If you must kill the Piranha daemons manually, then you must kill them "from the top down". In other words, you must first kill pulse, then fos, and then the nanny daemons. Killing fos first (for example) will cause pulse to think a failover should occur.

Use the command kill -s SIGTERM <pid> first, before using SIGKILL (also known as -9). In most cases, killing pulse or fos with SIGTERM will cause it to automatically kill all of its own children.

When I unplug the active node, a failover occurs, but when I plug it back in, another failover occurs back to the first node

This occurs because the system you unplugged was the primary node, which caused the backup node to become active. The unplugged node was still active (even if it had lost network connectivity). Plugging it back in created the situation where both nodes were declaring themselves active; in this case, the primary node always wins the stalemate. Therefore, even though the backup node was also the currently active node, it failed over and became inactive.

There are two ways to prevent this:

If possible, make sure the system you unplug is the backup system. If the backup system is also the active system (meaning that the primary system is inactive), then unplugging the backup system will cause a failover to the primary, but plugging the backup system back in will not cause a second failover (because it will lose the "both nodes active" stalemate).
The second method will always work no matter which node is unplugged. Just make sure that before the node is plugged back in, you first shut down Piranha. Then plug the node back into the network, and restart Piranha. The node will detect that there is already an active system and will start up as the inactive node.

Piranha is causing the private log files for each service to fill up with connect attempt messages

ftp/inetd and httpd have individual log files. Each connection attempt (by any software, for any reason) may cause a one or two line entry in these log files. Because the nanny daemons, being part of Piranha's service monitoring, must connect to each service on a regular basis, there is no way to configure Piranha to prevent this, short of disabling service monitoring.

The only work-arounds are to set up a cron or backup entry to "roll over" the file(s), preventing them from filling the disk, and/or increase the timeout parameters in Piranha's configuration file so the services are tested less often and therefore log fewer entries. Other possibilities, such as using symbolic links to the null device, could result in a loss of important security information and are not recommended.

Can I set up web services that are independent of FOS?

Apache can be configured to start multiple httpd daemons that listen on different TCP/IP ports. This means that you can configure one port for use with Piranha, and another that acts independently for other purposes.

Because this is more of an Apache configuration issue than a Piranha configuration issue, information on configuring Apache in this manner is beyond the scope of this document. Please see the Apache Software Foundation's website (http://www.apache.org/) for additional information on Apache configuration.

How can I set Piranha up so that I can use the Piranha Web Interface on the inactive node?

The Piranha Web Interface is designed to be used on the active node, with the resulting configuration file copied to the inactive system. It is possible to use the Piranha Web Interface on the inactive node if you start a second httpd daemon that uses a different TCP/IP port from the http service defined in the Piranha configuration file. Note that this is a similar situation to the previous question.

Can I have the services already running on the standby system, instead of having Piranha starting them?

Yes, you can replace the start_cmd and stop_cmd lines in the /etc/lvs.cf file with commands that do not affect the running services. However, this will result in a cluster configuration that can only survive one failover; in this case, a second failure will not cause the services to failover to the original node.

Error Messages

Most of the error messages are self-explanatory. Here are descriptions for some of the less obvious or more critical ones.

"Service type is not 'fos'"

You are attempting to run the fos program manually, but the Piranha configuration file does not have service = fos set.

"gratuitous `xxx` arps finished"

Each time a VIP address is created or removed, Piranha sends out ARP broadcasts to notify the network of the change in MAC address for that IP address. This message indicates that those broadcasts have completed.

"Incompatible heartbeat received -- other system not using identical services"

An attempt is being made (either due to mis-matched configuration files or manual startup of a Piranha component) to start one cluster node using lvs services and the other node in fos services. All nodes in a cluster must use the same Piranha cluster service.

"Notifying partner WE are taking control!"

A situation has occurred where the backup system needs to become (or already is) the active cluster node, and it is telling the primary node (which is trying to become active, or already is) that it must switch to inactive mode.

"PARTNER HAS TOLD US TO GO INACTIVE!"

This message is the partner to the one listed above. A situation has occurred where the backup system needs to become (or already is) the active cluster node, and it is telling the primary node (which is trying to become active, or already has) that it must switch to inactive mode.

"Undefined backup node marked as active? -- clearing that..."

The Piranha configuration file has set backup_active = 1, but there is no backup node IP address defined in the file. This message appears if Piranha is then started with its configuration file containing this error. As the message implies, Piranha is treating the situation as if backup_active = 0 was set in the configuration file instead.

"pulse: cannot create heartbeat socket -- running as root?"

The pulse daemon cannot start because it cannot create the TCP/IP socket needed for its heartbeat. The most common reasons for this is either pulse is being started by someone with a UID other than 0 (a non-root account), or pulse (or another Piranha daemon) is already running.

"no service active & available..."

The most common cause for this message occurs when no services in the Piranha configuration file are set as active = 1. If they are all set as inactive, then FOS has no services to control or monitor.

"fos: no failover services defined"

Piranha has service = fos enabled, but there are no FOS services defined in the configuration file.

Prev	Home	Next
Testing FOS	Up	Additional Information