Troubleshooting and Tips

Nodes don’t form a cluster

See the JGroups troubleshooting guide for typical problems. When nodes don’t form a cluster, it is normally either because the protocol needs additional configuration, or there’s a problem in the network configuration of the operating system or the network equipment (e.g., nodes cannot “see” one another via TCP).

To simplify the trouble-shooting process of the network issues, we have published the troubleshooter program. This program runs the same lower level stack as Jenkins HA, and thus exercises the network in the exact same fashion. When you type in a text from stdin and hit enter, you should see the text echoed on all nodes of the cluster (including the node in which you typed the text.)

A good first step to diagnose the network problem is to run two instances of the troubleshooter program on the same host and see if they can communicate with each other. Then do the same on all the hosts. In this way, you can further isolate the problem.

[id=ha-sect-udp-multicast ==== Using UDP/Multicast

Earlier versions of Jenkins Enterprise used UDP and IP multicast to communicate between nodes in the cluster. While this mode is fine when it works, some users (especially of CentOS/RedHat Enterprise Linux) reported problems with it. If you know UDP/multicast works well in your network environment and prefer to use it instead of the default TCP-based protocol, simply create a file $JENKINS_HOME/jgroups.xml with a JGroups configuration based on UDP, such as the following:

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://jgroups.org/schema/JGroups-3.2.xsd">
  <UDP mcast_port="${jgroups.udp.mcast_port:45588}" tos="8"
       ucast_recv_buf_size="20M" ucast_send_buf_size="640K"
       mcast_recv_buf_size="25M" mcast_send_buf_size="640K"
       loopback="true" discard_incompatible_packets="true"
       max_bundle_size="64K" max_bundle_timeout="30"
       ip_ttl="${jgroups.udp.ip_ttl:8}"
       enable_bundling="true" enable_diagnostics="true"
       thread_naming_pattern="cl"
       timer_type="new" timer.min_threads="4" timer.max_threads="10"
       timer.keep_alive_time="3000" timer.queue_max_size="500"
       thread_pool.enabled="true"
       thread_pool.min_threads="2" thread_pool.max_threads="8"
       thread_pool.keep_alive_time="5000" thread_pool.queue_enabled="true"
       thread_pool.queue_max_size="10000" thread_pool.rejection_policy="discard"
       oob_thread_pool.enabled="true"
       oob_thread_pool.min_threads="1" oob_thread_pool.max_threads="8"
       oob_thread_pool.keep_alive_time="5000"
       oob_thread_pool.queue_enabled="false" oob_thread_pool.queue_max_size="100"
       oob_thread_pool.rejection_policy="Run"/>
  <CENTRAL_LOCK/>
  <PING timeout="2000" num_initial_members="3"/>
  <MERGE2 max_interval="30000" min_interval="10000"/>
  <FD_SOCK/>
  <FD_ALL/>
  <VERIFY_SUSPECT timeout="1500"/>
  <BARRIER/>
  <pbcast.NAKACK exponential_backoff="300" xmit_stagger_timeout="200"
                 use_mcast_xmit="false" discard_delivered_msgs="true"/>
  <UNICAST/>
  <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="4M"/>
  <pbcast.GMS print_local_addr="true" join_timeout="3000" view_bundling="true"/>
  <UFC max_credits="2M" min_threshold="0.4"/>
  <MFC max_credits="2M" min_threshold="0.4"/>
  <FRAG2 frag_size="60K"/>
  <pbcast.STATE_TRANSFER/>
</config>

(In this case the old $JENKINS_HOME/cluster-identity.txt will be used to distinguish multiple Jenkins installations that might be running on the same LAN.)

Using a newer HA version on top of Jenkins Enterprise

As of HA components of version 3.6, it is possible to run newer HA components with an older Jenkins Enterprise (such as 12.05). Just use the HA proxy WAR as usual, if desired together with a matching version of the HA monitor, and use the HA status plugin 3.6 or newer.

Using a stand-by node as a slave

Because the stand-by node doesn’t really do any work while it’s standing by, a good way to utilize this otherwise idle resource is to use it as a slave. To do so, modify /etc/hosts and have it point to the other node in the cluster (for example, if you have alpha and bravo in a cluster, bravo will list alpha under the host name "other", and alpha will list bravo under the host name "other".) You can then configure Jenkins by setting up a new SSH slave that connects to "other".

Copying a Jenkins installation

From time to time you may want to create a copy of a Jenkins Enterprise installation. For example, you may wish to run a “staging” server where you test configuration changes such as new plugins, before deploying these changes to a “production” server.

To do so, you normally just make a copy of $JENKINS_HOME and run another Jenkins process with this home directory. But there are a few files specific to Jenkins Enterprise which you should not copy over into your alternate installation:

  • identity.key should not be copied, as this identifies the installation generally. For the new installation you will need a distinct license. You can obtain an evaluation license on your own for temporary use; contact sales for a permanent test license (not free but discounted).
  • The jgroups/ subdirectory should not be copied. Otherwise the new installation may attempt to join an HA cluster with the old installation.
  • cluster-identity.txt, if present (only created if you have a custom jgroups.xml), should not be copied.