### Eclipse Workspace Patch 1.0 #P docs-cluster_guide Index: en-US/Clustering_Guide_JGroups.xml =================================================================== --- en-US/Clustering_Guide_JGroups.xml (revision 93485) +++ en-US/Clustering_Guide_JGroups.xml (working copy) @@ -4,7 +4,21 @@ JGroups Services JGroups provides the underlying group communication support for - JBoss AS clusters. JBoss AS ships with a reasonable set of default JGroups + JBoss AS clusters. The way the AS's clustered services interact with + JGroups was covered previously in . + The focus of this chapter is on the details, particularly configuration + details and troubleshooting tips. This chapter is not intended to + be a complete set of JGroups documentation; we recommend that users + interested in in-depth JGroups knowledge also consult: + + + The JGroups project documentation at http://jgroups.org/ug.html + The JGroups wiki pages at jboss.org, rooted at https://www.jboss.org/community/wiki/JGroups + + + The first section of this chapter covers the many JGroups + configuration options in considerable detail. Readers should + understand that JBoss AS ships with a reasonable set of default JGroups configurations. Most applications just work out of the box with the default configurations. You only need to tweak them when you are deploying an application that has special network or performance requirements. @@ -12,8 +26,15 @@
Configuring a JGroups Channel's Protocol Stack The JGroups framework provides services to enable peer-to-peer communications between nodes in a - cluster. It is built on top a stack of network communication protocols that provide transport, - discovery, reliability and failure detection, and cluster membership management services. shows the protocol stack in JGroups. + cluster. Communication occurs over a communication channel. The channel built up from + a stack of network communication "protocols", each of which is responsible + for adding a particular capability to the overall behavior of the channel. + Key capabilities provided by various protocols include, among others, transport, + cluster discovery, message ordering, loss-less message delivery, detection + of failed peers, and cluster membership management services. + + shows a conceptual cluster with + each member's channel composed of a stack of JGroups protocols.
Protocol stack in JGroups @@ -22,207 +43,352 @@
- JGroups configurations often appear as a nested attribute in cluster related MBean services, such as - the PartitionConfig attribute in the ClusterPartition MBean or the - ClusterConfig attribute in the TreeCache MBean. You can - configure the behavior and properties of each protocol in JGroups via those MBean attributes. Below is - an example JGroups configuration in the ClusterPartition MBean. - - - ... ... - - - - - - - - - - - - - - - - - - - - - - ]]> + + In this section of the chapter, we look into some of the most commonly + used protocols, with the protocols organized by the type of behavior they + add to the overall channel. For each protocol, we discuss a few key + configuration attributes exposed by the protocol, but generally speaking + changing configuration attributes is a matter for experts only. More important + for most readers will be to get a general understanding of the purpose of the + various protocols. + + The JGroups configurations used in the AS appear as nested elements in + the $JBOSS_HOME/server/all/cluster/jgroups-channelfactory.sar/META-INF/jgroups-channelfactory-stacks.xml + file. This file is parsed by the ChannelFactory service, which + uses the contents to provide appropriately configured channels to the AS + clustered services that need them. See + for more on the ChannelFactory service. + + Following is an example protocol stack configuration from + jgroups-channelfactory-stacks.xml: + + + + + + + + + + + + + + + + + + + + + ]]> - All the JGroups configuration data is contained in the <Config> element under the JGroups config MBean attribute. This information is used to configure a JGroups Channel; the Channel is conceptually similar to a socket, and manages communication between peers in a cluster. Each element inside the <Config> element defines a particular JGroups Protocol; each Protocol performs one function, and the combination of those functions is what defines the characteristics of the overall Channel. In the next several sections, we will dig into the commonly used protocols and their options and explain exactly what they mean. + All the JGroups configuration data is contained in the <config> element. This information is used to configure a JGroups Channel; the Channel is conceptually similar to a socket, and manages communication between peers in a cluster. Each element inside the <config> element defines a particular JGroups Protocol; each Protocol performs one function, and the combination of those functions is what defines the characteristics of the overall Channel. In the next several sections, we will dig into the commonly used protocols and their options and explain exactly what they mean.
Common Configuration Properties - The following common properties are exposed by all of the JGroups protocols discussed below: + The following configuration property is exposed by all of the JGroups protocols discussed below: - - down_thread whether the protocol should create an internal queue and a queue processing thread (aka the down_thread) for messages passed down from higher layers. The higher layer could be another protocol higher in the stack, or the application itself, if the protocol is the top one on the stack. If true (the default), when a message is passed down from a higher layer, the calling thread places the message in the protocol's queue, and then returns immediately. The protocol's down_thread is responsible for reading messages off the queue, doing whatever protocol-specific processing is required, and passing the message on to the next protocol in the stack. - - - - up_thread is conceptually similar to down_thread, but here the queue and thread are for messages received from lower layers in the protocol stack. - - + stats whether the protocol should gather runtime statistics on its operations that can be exposed via tools like the AS's administration console or the JGroups Probe utility. What, if any, statistics are gathered depends on the protocol. Default is true. -Generally speaking, up_thread and down_thread should be set to false. - + + All of the protocols in the versions of JGroups used in JBoss AS 3.x and 4.x exposed + down_thread and up_thread attributes. + The JGroups version used in AS 5 and later no longer uses those attributes, + and a WARN message will be written to the server log if they are configured + for any protocol. +
Transport Protocols - The transport protocols send messages from one cluster node to another (unicast) or from cluster - node to all other nodes in the cluster (mcast). JGroups supports UDP, TCP, and TUNNEL as transport - protocols. + The transport protocols are responsible for actually sending messages on the network and receiving them from the network. + They also manage the pools of threads that are + used to deliver incoming messages up the protocol stack. + JGroups supports UDP, TCP, and TUNNEL as transport protocols. - The UDP, TCP, and TUNNEL elements are + The UDP, TCP, and TUNNEL protocols are mutually exclusive. You can only have one transport protocol in each JGroups - Config element + config element
UDP configuration - UDP is the preferred protocol for JGroups. UDP uses multicast or multiple unicasts to send and + UDP is the preferred protocol for JGroups. UDP uses multicast (or, + in an unusual configuration, multiple unicasts) to send and receive messages. If you choose UDP as the transport protocol for your cluster service, you need to configure it in the UDP sub-element in the JGroups - Config element. Here is an example. -]]> + config element. Here is an example. +]]> - The available attributes in the above JGroups configuration are listed below. + The available attributes in the above JGroups configuration are listed discussed below. + First, we discuss the attributes that are particular to the UDP transport protocol. + Then we will cover those attributes shown above that are also + used by the TCP and TUNNEL transport protocols. + + The attributes particular to UDP are: + ip_mcast specifies whether or not to use IP - multicasting. The default is true. If set to false, it will send n unicast packets rather than 1 multicast packet. Either way, packets are UDP datagrams. - + multicasting. The default is true. If set to false, + for messages to the entire group UDP will send n unicast packets + rather than 1 multicast packet. Either way, packets are UDP datagrams. + - mcast_addr specifies the multicast address (class D) for joining a group (i.e., the cluster). If omitted, the default is 228.8.8.8 - . - + mcast_addr specifies the + multicast address (class D) for communicating with the group (i.e., the cluster). + The standard protocol stack configurations in JBoss AS use the + value of system property jboss.partition.udpGroup, + if set, as the value for this attribute. Using the -u + command line switch when starting JBoss AS sets that value. + See for how to use this configuration attribute + to ensure JGroups channels are properly isolated from one another. + If this attribute is omitted, the default is 228.8.8.8. + - mcast_port specifies the multicast port number. If omitted, the - default is 45566. + mcast_port specifies the port + to use for multicast communication with the group. + See for how to use this configuration attribute + to ensure JGroups channels are properly isolated from one another. + If this attribute is omitted, the default is 45566. + + + mcast_send_buf_size, mcast_recv_buf_size, ucast_send_buf_size, + ucast_recv_buf_size define socket receive and send buffer sizes that + JGroups will request from the operating system. It is good to + have a large receive buffer size, so packets are less likely to get dropped due to + buffer overflow. Note, however, that the size of socket buffers is + limited by OS limits, so actually obtaining the desired + buffer may require OS-level configuration. See + for further + details. + + + bind_port specifies the port to + which the unicast receive socket should be bound. The default is + 0; i.e. use an ephemeral port. - bind_addr specifies the interface on which to receive and send multicasts (uses the -Djgroups.bind_address system property, if present). If you have a multihomed machine, set the bind_addr attribute or system property to the appropriate NIC IP address. By default, system property setting takes priority over XML attribute unless -Djgroups.ignore.bind_addr system property is set. + port_range specifies the number of + ports to try if the port identified by bind_port + is not available. The default is 1, meaning + only try to bind to bind_port. - receive_on_all_interfaces specifies whether this node - should listen on all interfaces for multicasts. The default is false. - It overrides the bind_addr property for receiving multicasts. - However, bind_addr (if set) is still used to send multicasts. + ip_ttl specifies time-to-live for IP Multicast packets. TTL is the commonly used term in multicast networking, but is actually something of a misnomer, since the value here refers to how many network hops a packet will be allowed to travel before networking equipment will drop it. + - send_on_all_interfaces specifies whether this node send UDP packets via all the NICs if you have a multi NIC machine. This means that the same multicast message is sent N times, so use with care. - - - - - receive_interfaces specifies a list of of interfaces to receive multicasts on. The multicast receive socket will listen on all of these interfaces. This is a comma-separated list of IP addresses or interface names. E.g. "192.168.5.1,eth1,127.0.0.1". - - - - - ip_ttl specifies time-to-live for IP Multicast packets. TTL is the commonly used term in multicast networking, but is actually something of a misnomer, since the value here refers to how many network hops a packet will be allowed to travel before networking equipment will drop it. - + tos specifies the traffic class for sending unicast and multicast datagrams. + + + The attributes that are common to all transport protocols, + and thus have the same meanings when used with TCP or TUNNEL, are: + + - use_incoming_packet_handler specifies whether to use a separate thread to process incoming messages. Sometimes receivers are overloaded (they have to handle de-serialization etc). Packet handler is a separate thread taking care of de-serialization, receiver thread(s) simply put packet in queue and return immediately. Setting this to true adds one more thread. The default is true. + singleton_name provides + a unique name for this transport protocol configuration. Used by the AS ChannelFactory + to support sharing of a transport protocol instance by different channels + that use the same transport protocol configuration. See + . - use_outgoing_packet_handler specifies whether to use a separate thread to process outgoing messages. The default is false. + bind_addr specifies the interface + on which to receive and send messages. + By default JGroups uses the value of system property jgroups.bind_addr, which + in turn can be easily set via the -b command line switch. + See for more on binding JGroups + sockets. - enable_bundling specifies whether to enable message bundling. - If it is true, the node would queue outgoing messages until - max_bundle_size bytes have accumulated, or - max_bundle_time milliseconds have elapsed, whichever occurs - first. Then bundle queued messages into a large message and send it. The messages are - unbundled at the receiver. The default is false. + receive_on_all_interfaces specifies whether this node + should listen on all interfaces for multicasts. The default is false. + It overrides the bind_addr property for receiving multicasts. + However, bind_addr (if set) is still used to send multicasts. + send_on_all_interfaces specifies whether this node send UDP packets via all the NICs if you have a multi NIC machine. This means that the same multicast message is sent N times, so use with care. + + + + + receive_interfaces specifies a list of of interfaces on which to receive multicasts. The multicast receive socket will listen on all of these interfaces. This is a comma-separated list of IP addresses or interface names. E.g. "192.168.5.1,eth1,127.0.0.1". + + + + + send_interfaces specifies a + list of of interfaces via which to send multicasts. The multicast s + ender socket will send on all of these interfaces. This is a + comma-separated list of IP addresses or interface names. E.g. + "192.168.5.1,eth1,127.0.0.1".This means that the + same multicast message is sent N times, so use with care. + + - loopback specifies whether to loop outgoing message - back up the stack. In unicast mode, the messages are sent to self. In - mcast mode, a copy of the mcast message is sent. The default is false + enable_bundling specifies whether to enable message bundling. + If true, the tranpsort protocol would queue outgoing messages until + max_bundle_size bytes have accumulated, or + max_bundle_time milliseconds have elapsed, whichever occurs + first. Then the transport protocol would bundle queued messages into one + large message and send it. The messages are + unbundled at the receiver. The default is false. + Message bundling can have significant performance benefits for channels + that are used for high volume sending of messages where the sender does + not block waiting for a response from recipients (e.g. a JBoss Cache + instance configured for REPL_ASYNC.) It can add considerable latency + to applications where senders need to block waiting for responses, so + it is not recommended for usages like JBoss Cache REPL_SYNC. - discard_incompatibe_packets specifies whether to - discard packets from different JGroups versions. Each message in the cluster is tagged - with a JGroups version. When a message from a different version of JGroups is received, - it will be discarded if set to true, otherwise a warning will be logged. The default is false + loopback specifies whether the thread sending a message + to the group should itself carry the message back up the stack for delivery. (Messages sent to + the group are always delivered to the sending node as well.) If + false the sending thread does not carry the message; + rather the transport protocol waits to read the message off the network + and uses one of the message delivery pool threads to deliver it. + The default is false, however the current + recommendation is to always set this to true in order + to ensure the channel receives its own messages in case the network + interface goes down. - mcast_send_buf_size, mcast_recv_buf_size, ucast_send_buf_size, - ucast_recv_buf_size define receive and send buffer sizes. It is good to - have a large receiver buffer size, so packets are less likely to get dropped due to - buffer overflow. + discard_incompatible_packets specifies whether to + discard packets sent by peers using a different JGroups version. Each message in the cluster is tagged + with a JGroups version. When a message from a different version of JGroups is received, + it will be silently discarded if this is set to true, otherwise a warning will be logged. In no case + will the message be delivered. The default is false - - tos specifies traffic class for sending unicast and multicast datagrams. - - + + enable_diagnostics specifies that the transport should open a + multicast socket on address diagnostics_addr and port + diagnostics_port to listen for diagnostic requests + sent by JGroups' Probe utility. + + + + + The various thread_pool + attributes configure the behavior of the pool of threads JGroups uses + to carry ordinary incoming messages up the stack. The various attributes + end up providing the constructor arguments for an instance of + java.util.concurrent.ThreadPoolExecutorService. + In the example above, the pool will have a core (i.e. minimum) size + of 8 threads, and a maximum size of 200 threads. If more than + 8 pool threads have been created, a thread returning from carrying + a message will wait for up to 5000 ms to be assigned a new message to + carry, after which it will terminate. If no threads are available to + carry a message, the (separate) thread reading messages off the socket + will place messages in a queue; the queue will hold up to 1000 messages. + If the queue is full, the thread reading messages off the socket will + discard the message. + + + + The various oob_thread_pool attributes + are similar to the thread_pool attributes in that + they configure a java.util.concurrent.ThreadPoolExecutorService + used to carry incoming messages up the protocol stack. In this case, + the pool is used to carry a special type of message known as an "Out-Of-Band" + message, OOB for short. OOB messages are exempt from the ordered-delivery + requirements of protocols like NAKACK and UNICAST and thus can be delivered + up the stack even if NAKACK or UNICAST are queueing up messages from + a particular sender. OOB messages are often used internally by JGroups + protocols and can be used applications as well. JBoss Cache in REPL_SYNC + mode, for example, uses OOB messages for the second phase of its + two-phase-commit protocol. + - - On Windows 2000 machines, because of the media sense feature being broken with multicast - (even after disabling media sense), you need to set the UDP protocol's - loopback attribute to true. -
@@ -232,31 +398,23 @@ TCP generates more network traffic when the cluster size increases. TCP is fundamentally a unicast protocol. To send multicast messages, JGroups uses multiple TCP unicasts. To use TCP as a transport protocol, you should define a TCP element - in the JGroups Config element. Here is an example of the + in the JGroups config element. Here is an example of the TCP element. -<TCP start_port="7800" - bind_addr="192.168.5.1" - loopback="true" - down_thread="false" up_thread="false"/> +<TCP singleton_name="tcp" + start_port="7800" end_port="7800"/> - Below are the attributes available in the TCP element. + Below are the attributes that are specific to the TCP protocol. - bind_addr specifies the binding address. It can also - be set with the -Djgroups.bind_address command line option at server - startup. - - start_port, end_port define the range of TCP ports - the server should bind to. The server socket is bound to the first available port from - start_port. If no available port is found (e.g., because of a - firewall) before the end_port, the server throws an exception. If no end_port is provided or end_port < start_port then there is no upper limit on the port range. If start_port == end_port, then we force JGroups to use the given port (start fails if port is not available). The default is 7800. If set to 0, then the operating system will pick a port. Please, bear in mind that setting it to 0 will work only if we use MPING or TCPGOSSIP as discovery protocol because TCCPING requires listing the nodes and their corresponding ports. + the server should bind to. The server socket is bound to the first available port beginning with + start_port. If no available port is found (e.g., because of other + sockets already using the ports) before the end_port, the server throws an exception. If no end_port is provided or end_port < start_port then there is no upper limit on the port range. If start_port == end_port, then we force JGroups to use the given port (start fails if port is not available). The default is 7800. If set to 0, then the operating system will pick a port. Please, bear in mind that setting it to 0 will work only if we use MPING or TCPGOSSIP as discovery protocol because TCCPING requires listing the nodes and their corresponding ports. - loopback specifies whether to loop outgoing message - back up the stack. In unicast mode, the messages are sent to self. In - mcast mode, a copy of the mcast message is sent. The default is false. + bind_port in TCP is just an alias for start_port; if + configured internally it sets start_port. recv_buf_size, send_buf_size define receive and send buffer sizes. It is good to have a large receiver buffer size, so packets are less likely to get dropped due to buffer overflow. @@ -289,19 +447,24 @@ + + + All of the attributes common to all protocols discussed in + the UDP protocol section also apply to TCP. +
TUNNEL configuration - The TUNNEL protocol uses an external router to send messages. The external router is known as - a GossipRouter. Each node has to register with the router. All messages are sent to the router and forwarded on to their destinations. The TUNNEL approach can be used to setup communication with nodes behind firewalls. A node can establish a TCP connection to the GossipRouter through the firewall (you can use port 80). The same connection is used by the router to send messages to nodes behind the firewall as most firewalls do not permit outside hosts to initiate a TCP connection to a host inside the firewall. The TUNNEL configuration is defined in the TUNNEL element in the JGroups Config element. Here is an example.. + The TUNNEL protocol uses an external router process to send messages. The external router is a Java process running the + org.jgroups.stack.GossipRouter main class. Each node has to register with the router. All messages are sent to the router and forwarded on to their destinations. The TUNNEL approach can be used to setup communication with nodes behind firewalls. A node can establish a TCP connection to the GossipRouter through the firewall (you can use port 80). The same connection is used by the router to send messages to nodes behind the firewall as most firewalls do not permit outside hosts to initiate a TCP connection to a host inside the firewall. The TUNNEL configuration is defined in the TUNNEL element in the JGroups config element. Here is an example.. -<TUNNEL router_port="12001" - router_host="192.168.5.1" - down_thread="false" up_thread="false/> +<TUNNEL singleton_name="tunnel" + router_port="12001" + router_host="192.168.5.1"/> @@ -316,10 +479,16 @@ GossipRouter is listening. - loopback specifies whether to loop messages back up - the stack. The default is true. + reconnect_interval specifies the interval in + ms at which TUNNEL will attempt to connect to the GossipRouter if the + connection is not established. Default is 5000. + + + All of the attributes common to all protocols discussed in + the UDP protocol section also apply to TUNNEL. +
@@ -329,10 +498,17 @@
Discovery Protocols - The cluster needs to maintain a list of current member nodes at all times so that the load balancer and client interceptor know how to route their requests. Discovery protocols are used to discover active nodes in the cluster and detect the oldest member of the cluster, which is the coordinator. All initial nodes are discovered when the cluster starts up. - When a new node joins the cluster later, it is only discovered after the group membership protocol - (GMS, see ) admits it into the group. - Since the discovery protocols sit on top of the transport protocol, you can choose to use different discovery protocols based on your transport protocol. These are also configured as sub-elements in the JGroups MBean Config element. + When a channel on one node connects it needs to discover what other nodes + have compatible channels running and which of those nodes is currently + serving as the "coordinator" responsible for allowing new nodes to + join the group. Discovery protocols are used to discover active nodes in + the cluster and determine which is the coordinator. This information is + then provided to the group membership protocol (GMS, see ) + which communicates with the coordinator node's GMS to bring the newly + connecting node into the group. + Discovery protocols also help merge protocols (see + to detect cluster-split situations. + Since the discovery protocols sit on top of the transport protocol, you can choose to use different discovery protocols based on your transport protocol. These are also configured as sub-elements in the JGroups config element. @@ -346,19 +522,15 @@ - -<PING timeout="2000" - num_initial_members="2" - down_thread="false" up_thread="false"/> + <PING timeout="2000" + num_initial_members="3"/> Here is another example PING configuration for contacting a Gossip Router. -]]> + timeout="2000" + num_initial_members="3"/>]]> @@ -387,9 +559,14 @@ milliseconds) for the lease from the GossipRouter. The default is 20000. - initial_hosts is a comma-seperated list of addresses - (e.g., host1[12345],host2[23456]), which are pinged for - discovery. + initial_hosts is a comma-separated list of addresses/ports + (e.g., host1[12345],host2[23456]) which are pinged for + discovery. Default is null, meaning multicast + discovery should be used. If initial_hosts + is specified, all possible cluster + members must be listed, not just a few "well known hosts"; + otherwise discovery of cluster + splits by MERGE2 will not work reliably. If both gossip_host and gossip_port are defined, the @@ -409,11 +586,9 @@ The TCPGOSSIP protocol only works with a GossipRouter. It works essentially the same way as the PING protocol configuration with valid gossip_host and gossip_port attributes. It works on top of both UDP and TCP transport protocols. Here is an example. -]]> +]]> @@ -428,8 +603,8 @@ responses to wait for unless timeout has expired. The default is 2. - initial_hosts is a comma-seperated list of addresses - (e.g., host1[12345],host2[23456]) for GossipRouters to register + initial_hosts is a comma-separated list of addresses/ports + (e.g., host1[12345],host2[23456]) of GossipRouters to register with. @@ -439,16 +614,14 @@
TCPPING - The TCPPING protocol takes a set of known members and ping them for discovery. This is + The TCPPING protocol takes a set of known members and pings them for discovery. This is essentially a static configuration. It works on top of TCP. Here is an example of the - TCPPING configuration element in the JGroups Config + TCPPING configuration element in the JGroups config element. - -<TCPPING timeout="2000" - initial_hosts="hosta[2300],hostb[3400],hostc[4500]" - port_range="3" - num_initial_members="3" - down_thread="false" up_thread="false"/> + <TCPPING timeout="2000" + num_initial_members="3"/ + initial_hosts="hosta[2300],hostb[3400],hostc[4500]" + port_range="3"> @@ -463,12 +636,17 @@ responses to wait for unless timeout has expired. The default is 2. - initial_hosts is a comma-seperated list of addresses - (e.g., host1[12345],host2[23456]) for pinging. + initial_hosts is a comma-separated list of addresses/ports + (e.g., host1[12345],host2[23456]) which are pinged for + discovery. All possible cluster + members must be listed, not just a few "well known hosts"; otherwise discovery of cluster + splits by MERGE2 will not work reliably. - port_range specifies the number of consecutive ports to be probed when getting the initial membership, starting with the port specified in the initial_hosts parameter. Given the current values of port_range and initial_hosts above, the TCPPING layer will try to connect to hosta:2300, hosta:2301, hosta:2302, hostb:3400, hostb:3401, hostb:3402, hostc:4500, hostc:4501, hostc:4502. The configuration options allows for multiple nodes on the same host to be pinged. + port_range specifies the number of consecutive ports to be probed when getting the initial membership, starting with the port specified in the initial_hosts parameter. Given the current values of port_range and initial_hosts above, the TCPPING layer will try to connect to hosta[2300], hosta[2301], hosta[2302], hostb[3400], hostb[3401], hostb[3402], hostc[4500], hostc[4501], hostc[4502]. This configuration option allows for multiple possible ports on the same host to be pinged without having to spell out all of the combinations. + If in your TCP protocol configuration your end_port is greater than your start_port, using a TCPPING port_range equal to the difference is advised in order to ensure + a node is pinged no matter which port in the allowed range it ended up bound to. @@ -480,17 +658,24 @@
MPING - MPING uses IP multicast to discover the initial membership. It can be used with all transports, but usually this is used in combination with TCP. TCP usually requires TCPPING, which has to list all group members explicitly, but MPING doesn't have this requirement. The typical use case for this is when we want TCP as transport, but multicasting for discovery so we don't have to define a static list of initial hosts in TCPPING or require external Gossip Router. + MPING uses IP multicast to discover the initial membership. Unlike the + other discovery protocols, which delegate the sending and receiving of + discovery messages on the network to the transport protocol, MPING handles + opens its own sockets to send and receive multicast discovery messages. + As a result it can be used with all transports. But, it usually is used + in combination with TCP. TCP usually requires TCPPING, which has to list + all possible group members explicitly, while MPING doesn't have this + requirement. The typical use case for MPING is when we want TCP for + regular message transport, but UDP multicasting is allowed for discovery. <MPING timeout="2000" + num_initial_members="3" bind_to_all_interfaces="true" mcast_addr="228.8.8.8" mcast_port="7500" - ip_ttl="8" - num_initial_members="3" - down_thread="false" up_thread="false"/> + ip_ttl="8"/> @@ -506,7 +691,11 @@ bind_addr specifies the interface on which to send - and receive multicast packets. + and receive multicast packets. + By default JGroups uses the value of system property jgroups.bind_addr, which + in turn can be easily set via the -b command line switch. + See for more on binding JGroups + sockets. bind_to_all_interfaces overrides the @@ -524,22 +713,21 @@
Failure Detection Protocols - The failure detection protocols are used to detect failed nodes. Once a failed node is detected, a suspect verification phase can occur after which, if the node is still considered dead, the cluster updates its view so that the load balancer and client interceptors know to avoid the dead node. The failure detection protocols are configured as sub-elements in the JGroups MBean - Config element. + The failure detection protocols are used to detect failed nodes. Once a failed node is detected, a suspect verification phase can occur after which, if the node is still considered dead, the cluster updates its membership view so that further messages are not sent to the failed node and the service using JGroups is aware the node is no longer part of the cluster. The failure detection protocols are configured as sub-elements in the JGroups + config element.
FD - FD is a failure detection protocol based on heartbeat messages. This protocol requires each node to periodically send are-you-alive messages to its neighbour. If the neighbour fails to respond, the calling node sends a SUSPECT message to the cluster. The current group coordinator can optionally double check whether the suspected node is indeed dead after which, if the node is still considered dead, updates the cluster's view. Here is an example FD configuration. + FD is a failure detection protocol based on heartbeat messages. This protocol requires each node to periodically send an are-you-alive message to its neighbor. If the neighbor fails to respond, the calling node sends a SUSPECT message to the cluster. The current group coordinator can optionally double check whether the suspected node is indeed dead (see VERIFY_SUSPECT below) after which, if the node is still considered dead, it updates the cluster's membership view. Here is an example FD configuration. -<FD timeout="2000" - max_tries="3" - shun="true" - down_thread="false" up_thread="false"/> +<FD timeout="6000" + max_tries="5" + shun="true"/> @@ -554,14 +742,15 @@ are-you-alive messages from a node before the node is suspected. The default is 2. - shun specifies whether a failed node will be shunned. - Once shunned, the node will be expelled from the cluster even if it comes back later. - The shunned node would have to re-join the cluster through the discovery process. JGroups allows to configure itself such that shunning leads to automatic rejoins and state transfer, which is the default behaivour within JBoss Application Server. + shun specifies whether a failed node will be shunned, i.e. not allowed + to send messages to the group without formally rejoining. A shunned node would have to re-join the cluster + through the discovery process. JGroups allows applications to configure a channel such that shunning leads + to automatic rejoins and state transfer. This the default behavior within JBoss Application Server. - Regular traffic from a node counts as if it is a live. So, the are-you-alive messages are - only sent when there is no regular traffic to the node for sometime. + Regular traffic from a node counts as if it is a heartbeat response. So, the are-you-alive messages are + only sent when there is no regular traffic to the node for some time.
@@ -571,17 +760,21 @@
FD_SOCK -FD_SOCK is a failure detection protocol based on a ring of TCP sockets created between group members. Each member in a group connects to its neighbor (last member connects to first) thus forming a ring. Member B is suspected when its neighbor A detects abnormally closed TCP socket (presumably due to a node B crash). However, if a member B is about to leave gracefully, it lets its neighbor A know, so that it does not become suspected. The simplest FD_SOCK configuration does not take any attribute. You can just declare an empty FD_SOCK element in JGroups's Config element. +FD_SOCK is a failure detection protocol based on a ring of TCP sockets created between group members. Each member in a group connects to its neighbor (last member connects to first) thus forming a ring. Member B is suspected when its neighbor A detects an abnormally closed TCP socket (presumably due to a node B crash). However, if a member B is about to leave gracefully, it lets its neighbor A know, so that it does not become suspected. The simplest FD_SOCK configuration does not take any attribute. You can just declare an empty FD_SOCK element in JGroups's config element. -<FD_SOCK_down_thread="false" up_thread="false"/> +<FD_SOCK/> There available attributes in the FD_SOCK element are listed below. - bind_addr specifies the interface to which the server socket should bind to. If -Djgroups.bind_address system property is defined, XML value will be ignore. This behaivour can be reversed setting -Djgroups.ignore.bind_addr=true system property. + bind_addr specifies the interface to which the server socket should be bound. + By default JGroups uses the value of system property jgroups.bind_addr, which + in turn can be easily set via the -b command line switch. + See for more on binding JGroups + sockets. @@ -595,17 +788,16 @@ ]]> +]]> - The available attributes in the FD_SOCK element are listed below. + The available attributes in the VERIFY_SUSPECT element are listed below. - timeout specifies how long to wait for a response from the suspected member before considering it dead. + timeout specifies how long to wait for a response from the suspected member before considering it dead. @@ -621,8 +813,6 @@ FD - - @@ -645,11 +835,9 @@ + - FD_SOCK: - - @@ -683,9 +871,11 @@ + + - The aim of a failure detection layer is to report real failures and therefore avoid false suspicions. There are two solutions: + The aim of a failure detection layer is to report promptly real failures and yet avoid false suspicions. There are two solutions: @@ -699,17 +889,16 @@ - - ]]> + + +]]> - This suspects a member when the socket to the neighbor has been closed abonormally (e.g. process crash, because the OS closes all sockets). However, f a host or switch crashes, then the sockets won't be closed, therefore, as a seond line of defense, FD will suspect the neighbor after 50 seconds. Note that with this example, if you have your system stopped in a breakpoint in the debugger, the node you're debugging will be suspected after ca 50 seconds. + This suspects a member when the socket to the neighbor has been closed abonormally (e.g. a process crash, because the OS closes all sockets). However, if a host or switch crashes, then the sockets won't be closed, so, as a second line of defense FD will suspect the neighbor after 30 seconds. Note that with this example, if you have your system stopped in a breakpoint in the debugger, the node you're debugging will be suspected after roughly 30 seconds. - A combination of FD and FD_SOCK provides a solid failure detection layer and for this reason, such technique is used accross JGroups configurations included within JBoss Application Server. + A combination of FD and FD_SOCK provides a solid failure detection layer and for this reason, such technique is used across JGroups configurations included within JBoss Application Server.
@@ -719,7 +908,7 @@
Reliable Delivery Protocols - Reliable delivery protocols within the JGroups stack ensure that data pockets are actually delivered in the right order (FIFO) to the destination node. The basis for reliable message delivery is positive and negative delivery acknowledgments (ACK and NAK). In the ACK mode, the sender resends the message until the acknowledgment is received from the receiver. In the NAK mode, the receiver requests retransmission when it discovers a gap. + Reliable delivery protocols within the JGroups stack ensure that messages are actually delivered and delivered in the right order (FIFO) to the destination node. The basis for reliable message delivery is positive and negative delivery acknowledgments (ACK and NAK). In the ACK mode, the sender resends the message until the acknowledgment is received from the receiver. In the NAK mode, the receiver requests retransmission when it discovers a gap. @@ -727,11 +916,10 @@
UNICAST - The UNICAST protocol is used for unicast messages. It uses ACK. It is configured as a sub-element under the JGroups Config element. If the JGroups stack is configured with TCP transport protocol, UNICAST is not necessary because TCP itself guarantees FIFO delivery of unicast messages. Here is an example configuration for the UNICAST protocol. + The UNICAST protocol is used for unicast messages. It uses positive acknowlegements (ACK). It is configured as a sub-element under the JGroups config element. If the JGroups stack is configured with the TCP transport protocol, UNICAST is not necessary because TCP itself guarantees FIFO delivery of unicast messages. Here is an example configuration for the UNICAST protocol: -<UNICAST timeout="100,200,400,800" -down_thread="false" up_thread="false"/> +<UNICAST timeout="300,600,1200,2400,3600"/> There is only one configurable attribute in the UNICAST element. @@ -740,7 +928,14 @@ timeout specifies the retransmission timeout (in milliseconds). For instance, if the timeout is "100,200,400,800", the sender resends the message if it hasn't received an ACK after 100 ms the first time, and the second time it - waits for 200 ms before resending, and so on. + waits for 200 ms before resending, and so on. A low value for the first timeout + allows for prompt retransmission of dropped messages, but at the potential + cost of unnecessary retransmissions if messages aren't actually lost, + but rather ACKs just aren't received before the timeout. High values + (e.g. "1000,2000,3000") can improve performance if the network has + been tuned such that UDP datagram losses are infrequent. High values + on lossy networks will hurt performance since later messages will not be + delivered until lost messages are retransmitted.
@@ -748,48 +943,49 @@
NAKACK - The NAKACK protocol is used for multicast messages. It uses NAK. Under this protocol, each - message is tagged with a sequence number. The receiver keeps track of the sequence numbers and - deliver the messages in order. When a gap in the sequence number is detected, the receiver asks - the sender to retransmit the missing message. The NAKACK protocol is configured as the - pbcast.NAKACK sub-element under the JGroups Config + The NAKACK protocol is used for multicast messages. It uses negative acknowlegements (NAK). Under this protocol, each + message is tagged with a sequence number. The receiver keeps track of the received sequence numbers and + delivers the messages in order. When a gap in the series of received sequence numbers is detected, the receiver + schedules a task to periodically ask the sender to retransmit the missing message. The task + is cancelled if the missing message is received. The NAKACK protocol is configured as the + pbcast.NAKACK sub-element under the JGroups config element. Here is an example configuration. -<pbcast.NAKACK max_xmit_size="60000" use_mcast_xmit="false" - +<pbcast.NAKACK max_xmit_size="60000" use_mcast_xmit="false" retransmit_timeout="300,600,1200,2400,4800" gc_lag="0" - discard_delivered_msgs="true" - down_thread="false" up_thread="false"/> + discard_delivered_msgs="true"/> The configurable attributes in the pbcast.NAKACK element are as follows. - retransmit_timeout specifies the retransmission - timeout (in milliseconds). It is the same as the timeout attribute in - the UNICAST protocol. + retransmit_timeout specifies the series of timeouts (in milliseconds) after which retransmission + is requested if a missing message has not yet been received. use_mcast_xmit determines whether the sender should - send the retransmission to the entire cluster rather than just the node requesting it. - This is useful when the sender drops the pocket -- so we do not need to retransmit for - each node. + send the retransmission to the entire cluster rather than just to the node requesting it. + This is useful when the sender's network layer tends to drop packets, + avoiding the need to individually retransmit to each node. - max_xmit_size specifies maximum size for a bundled - retransmission, if multiple packets are reported missing. + max_xmit_size specifies the maximum size (in bytes) for a bundled + retransmission, if multiple messages are reported missing. discard_delivered_msgs specifies whether to discard - delivery messages on the receiver nodes. By default, we save all delivered messages. - However, if we only ask the sender to resend their messages, we can enable this option + delivered messages on the receiver nodes. By default, nodes save delivered messages so + any node can retransmit a lost message in case the original sender has crashed + or left the group. However, if we only ask the sender to resend their messages, we can enable this option and discard delivered messages. - gc_lag specifies the number of messages garbage collection lags behind. + gc_lag specifies the number of messages to keep in memory for retransmission + even after the periodic cleanup protocol (see ) indicates all peers have received the message. + Default is 20. @@ -800,14 +996,13 @@ The group membership service (GMS) protocol in the JGroups stack maintains a list of active nodes. It handles the requests to join and leave the cluster. It also handles the SUSPECT messages sent by failure - detection protocols. All nodes in the cluster, as well as the load balancer and client side - interceptors, are notified if the group membership changes. The group membership service is + detection protocols. All nodes in the cluster, as well as any interested + services like JBoss Cache or HAPartition, are notified if the group membership changes. The group membership service is configured in the pbcast.GMS sub-element under the JGroups - Config element. Here is an example configuration. + config element. Here is an example configuration. <pbcast.GMS print_local_addr="true" join_timeout="3000" - down_thread="false" up_thread="false" join_retry_timeout="2000" shun="true" view_bundling="true"/> @@ -822,23 +1017,26 @@ milliseconds to wait for a new node JOIN request to succeed. Retry afterwards. - join_retry_timeout specifies the maximum number of - milliseconds to wait after a failed JOIN to re-submit it. + join_retry_timeout specifies the number of + milliseconds to wait after a failed JOIN before trying again. print_local_addr specifies whether to dump the node's - own address to the output when started. + own address to the standard output when starting. - shun specifies whether a node should shun itself if - it receives a cluster view that it is not a member node. + shun specifies whether a node should shun + (i.e. disconnect) itself if it receives a cluster view in which it is not a member node. disable_initial_coord specifies whether to prevent - this node as the cluster coordinator. + this node becoming the cluster coordinator + during initial connection of the channel. + This flag does not prevent a node becoming coordinator later, if + the current coordinator leaves the group. - view_bundling specifies whether multiple JOIN or LEAVE request arriving at the same time are bundled and handled together at the same time, only sending out 1 new view / bundle. This is is more efficient than handling each request separately. + view_bundling specifies whether multiple JOIN or LEAVE requests arriving at the same time are bundled and handled together at the same time, resulting in only 1 new view incorporating all changes. This is is more efficient than handling each request separately. @@ -850,20 +1048,20 @@ Flow Control (FC) The flow control (FC) protocol tries to adapt the data sending rate to the data receipt rate among nodes. If a sender node is too fast, it - might overwhelm the receiver node and result in dropped packets that - have to be retransmitted. In JGroups, the flow control is implemented via a + might overwhelm the receiver node and result in out-of-memory conditions + or dropped packets that have to be retransmitted. In JGroups, flow control is implemented via a credit-based system. The sender and receiver nodes have the same number of credits (bytes) to start with. The sender subtracts credits by the number of bytes in messages it sends. The receiver accumulates credits for the bytes in the messages it receives. When the sender's credit - drops to a threshold, the receivers sends some credit to the sender. If the sender's credit is + drops to a threshold, the receivers send some credit to the sender. If the sender's credit is used up, the sender blocks until it receives credits from the receiver. The flow control protocol is configured in the FC sub-element under the JGroups - Config element. Here is an example configuration. + config element. Here is an example configuration. -<FC max_credits="1000000" -down_thread="false" up_thread="false" - min_threshold="0.10"/> +<FC max_credits="2000000" + min_threshold="0.10" + ignore_synchronous_response="true"/> @@ -874,45 +1072,47 @@ (in bytes). This value should be smaller than the JVM heap size. - min_credits specifies the threshold credit on the - sender, below which the receiver should send in more credits. + min_credits specifies the number of bytes the + receipt of which should trigger the receiver to send more credits to the sender. + + + min_threshold specifies percentage of the + max_credits that should be used to calculate min_credits. + Setting this overrides the min_credits attribute. - min_threshold specifies percentage value of the - threshold. It overrides the min_credits attribute. + ignore_synchronous_response specifies whether threads that have + carried messages up to the application should be allowed to carry outgoing messages back down + through FC without blocking for credits. The term "synchronous response" + refers to the fact that such an outgoing message is typically a response + to an incoming RPC-type message. Not allowing the threads JGroups uses + to carry messages up to block in FC is useful in preventing certain + deadlock scenarios, so a value of true is recommended. - -Note - - Applications that use synchronous group RPC calls primarily do not require FC protocol in their JGroups protocol stack because synchronous communication, where the hread that makes the call blocks waiting for responses from all the members of the group, already slows overall rate of calls. Even though TCP provides flow control by itself, FC is still required in TCP based JGroups stacks because of group communication, where we essentially have to send group messages at the highest speed the slowest receiver can keep up with. TCP flow control only takes into account individual node communications and has not a notion of who's the slowest in the group, which is why FC is required. - - -
+ Why is FC needed on top of TCP ? TCP has its own flow control ! - The reason is group communication, where we essentially have to send group messages at the highest speed the slowest receiver can keep up with. Let's say we have a cluster {A,B,C,D}. D is slow (maybe overloaded), the rest is fast. When A sends a group message, it establishes TCP connections A-A (conceptually), A-B, A-C and A-D (if they don't yet exist). So let's say A sends 100 million messages to the cluster. Because TCP's flow control only applies to A-B, A-C and A-D, but not to A-{B,C,D}, where {B,C,D} is the group, it is possible that A, B and C receive the 100M, but D only received 1M messages. (BTW: this is also the reason why we need NAKACK, although TCP does its own retransmission). + The reason is group communication, where we essentially have to send group messages at the highest speed the slowest receiver can keep up with. Let's say we have a cluster {A,B,C,D}. D is slow (maybe overloaded), the rest are fast. When A sends a group message, it uses TCP connections A-A (conceptually), A-B, A-C and A-D. So let's say A sends 100 million messages to the cluster. Because TCP's flow control only applies to A-B, A-C and A-D, but not to A-{B,C,D}, where {B,C,D} is the group, it is possible that A, B and C receive the 100M, but D only received 1M messages. (By the way, this is also the reason why we need NAKACK, even though TCP does its own retransmission). - Now JGroups has to buffer all messages in memory for the case when the original sender S dies and a node asks for retransmission of a message of S. Because all members buffer all messages they received, they need to purge stable messages (= messages seen by everyone) every now and then. This is done by the STABLE protocol, which can be configured to run the stability protocol round time based (e.g. every 50s) or size based (whenever 400K data has been received). - - + Now JGroups has to buffer all messages in memory for the case when the original sender S dies and a node asks for retransmission of a message sent by S. Because all members buffer all messages they received, they need to purge stable messages (i.e. messages seen by everyone) every now and then. (This is done purging process is managed by the STABLE protocol; see ). In the above case, the slow node D will prevent the group from purging messages above 1M, so every member will buffer 99M messages ! This in most cases leads to OOM exceptions. Note that - although the sliding window protocol in TCP will cause writes to block if the window is full - we assume in the above case that this is still much faster for A-B and A-C than for A-D. - So, in summary, we need to send messages at a rate the slowest receiver (D) can handle. + So, in summary, even with TCP we need to FC to ensure we send messages at a rate the slowest receiver (D) can handle. -
+ -
+ So do I always need FC? This depends on how the application uses the JGroups channel. Referring to the example above, if there was something about the application that would naturally cause A to slow down its rate of sending because D wasn't keeping up, then FC would not be needed. - A good example of such an application is one that makes synchronous group RPC calls (typically using a JGroups RpcDispatcher.) By synchronous, we mean the thread that makes the call blocks waiting for responses from all the members of the group. In that kind of application, the threads on A that are making calls would block waiting for responses from D, thus naturally slowing the overall rate of calls. + A good example of such an application is one that uses JGroups to make synchronous group RPC calls. By synchronous, we mean the thread that makes the call blocks waiting for responses from all the members of the group. In that kind of application, the threads on A that are making calls would block waiting for responses from D, thus naturally slowing the overall rate of calls. A JBoss Cache cluster configured for REPL_SYNC is a good example of an application that makes synchronous group RPC calls. If a channel is only used for a cache configured for REPL_SYNC, we recommend you remove FC from its protocol stack. @@ -923,7 +1123,7 @@ Another case where FC may not be needed is for a channel used by a JBoss Cache configured for buddy replication and a single buddy. Such a channel will in many respects act like a two node cluster, where messages are only exchanged with one other node, the buddy. (There may be other messages related to data gravitation that go to all members, but in a properly engineered buddy replication use case these should be infrequent. But if you remove FC be sure to load test your application.) -
+ @@ -932,10 +1132,10 @@
Fragmentation (FRAG2) - This protocol fragments messages larger than certain size. Unfragments at the receiver's side. It works for both unicast and multicast messages. It is configured in the FRAG2 sub-element under the JGroups Config element. Here is an example configuration. + This protocol fragments messages larger than certain size. Unfragments at the receiver's side. It works for both unicast and multicast messages. It is configured in the FRAG2 sub-element under the JGroups config element. Here is an example configuration. ]]> + ]]> @@ -943,12 +1143,14 @@ - frag_size specifies the max frag size in bytes. Messages larger than that are fragmented. + frag_size specifies the max frag size in bytes. Messages larger than that are fragmented. For stacks + using the UDP transport, this needs to be a value less than 64KB, the maximum UDP datagram size. For TCP-based stacks + it needs to be less than the value of max_credits in the FC protocol. Note - TCP protocol already provides fragmentation but a fragmentation JGroups protocol is still needed if FC is used. The reason for this is that if you send a message larger than FC.max_bytes, FC protocol would block. So, frag_size within FRAG2 needs to be set to always be less than FC.max_bytes. + TCP protocol already provides fragmentation but a JGroups fragmentation protocol is still needed if FC is used. The reason for this is that if you send a message larger than FC.max_credits, the FC protocol will block forever. So, frag_size within FRAG2 needs to be set to always be less than FC.max_credits. @@ -957,42 +1159,45 @@
State Transfer - The state transfer service transfers the state from an existing node (i.e., the cluster - coordinator) to a newly joining node. It is configured in the - pbcast.STATE_TRANSFER sub-element under the JGroups Config + The state transfer service requests the application state (serialized as a byte array) from an existing node (i.e., the cluster + coordinator) and transfer it to a newly joining node. It tracks the sequence of messages that + went into creating the application state, providing a valid starting point for message tracking by + reliable delivery protocols like NAKACK and UNICAST. It is configured in the + pbcast.STATE_TRANSFER sub-element under the JGroups config element. It does not have any configurable attribute. Here is an example configuration. -<pbcast.STATE_TRANSFER down_thread="false" up_thread="false"/> +<pbcast.STATE_TRANSFER/>
Distributed Garbage Collection (STABLE) - In a JGroups cluster, all nodes have to store all messages received for potential retransmission in case of a failure. However, if we store all messages forever, we will run out of memory. So, the distributed garbage collection service in JGroups periodically purges messages that have seen by all nodes from the memory in each node. The distributed garbage collection service is configured in the pbcast.STABLE sub-element under the JGroups Config element. Here is an example configuration. + In a JGroups cluster, all nodes have to store all messages received for potential retransmission in case of a failure. However, if we store all messages forever, we will run out of memory. So, the distributed garbage collection service in JGroups periodically purges messages that have seen by all nodes from the memory in each node. The distributed garbage collection service is configured in the pbcast.STABLE sub-element under the JGroups config element. Here is an example configuration. <pbcast.STABLE stability_delay="1000" desired_avg_gossip="5000" - down_thread="false" up_thread="false" - max_bytes="400000"/> + max_bytes="400000"/> The configurable attributes in the pbcast.STABLE element are as follows. desired_avg_gossip specifies intervals (in - milliseconds) of garbage collection runs. Value 0 disables this - service. + milliseconds) of garbage collection runs. Value 0 disables + interval-based execution of garbage collection. max_bytes specifies the maximum number of bytes received before the cluster triggers a garbage collection run. Value - 0 disables this service. + 0 disables execution of garbage collection + based on bytes received. - stability_delay specifies delay before we send STABILITY msg (give others a change to send first). If used together with max_bytes, this attribute should be set to a small number. + stability_delay specifies the maximum amount (in milliseconds) of a random delay introduced before a node sends its STABILITY msg + at the end of a garbage collection run. The delay gives other nodes concurrently running a STABLE task a change to send first. If used together with max_bytes, this attribute should be set to a small number. @@ -1003,13 +1208,12 @@
Merging (MERGE2) - When a network error occurs, the cluster might be partitioned into several different partitions. JGroups has a MERGE service that allows the coordinators in partitions to communicate with each other and form a single cluster back again. The flow control service is configured in the MERGE2 sub-element under the JGroups Config element. Here is an example configuration. + When a network error occurs (e.g. a crashed switch), the cluster might be partitioned into several different sub-groups. JGroups has "merge" protocols that allow the coordinators in the sub-groups to communicate with each other (once the network heals) and merge back into a single group again. This service is configured in the MERGE2 sub-element under the JGroups config element. Here is an example configuration. -<MERGE2 max_interval="10000" - min_interval="2000" - down_thread="false" up_thread="false"/> +<MERGE2 max_interval="100000" + min_interval="20000"/> @@ -1017,19 +1221,21 @@ max_interval specifies the maximum number of - milliseconds to send out a MERGE message. + milliseconds to wait before sending out a MERGE message. min_interval specifies the minimum number of - milliseconds to send out a MERGE message. + milliseconds to wait before send out a MERGE message. JGroups chooses a random value between min_interval and - max_interval to send out the MERGE message. + max_interval to periodically send out the MERGE message. - The cluster states are not merged in a merger. This has to be done by the application. If MERGE2 is used in conjunction with TCPPING, the initial_hosts attribute must contain all the nodes that could potentially be merged back, in order for the merge process to work properly. Otherwise, the merge process would not merge all the nodes even though shunning is disabled. Alternatively use MPING, which is commonly used with TCP to provide multicast member discovery capabilities, instead of TCPPING to avoid having to specify all the nodes. - + The application state maintained by the application using a channel is not merged by JGroups during a merge. This has to be done by the application. + + + If MERGE2 is used in conjunction with TCPPING, the initial_hosts attribute must contain all the nodes that could potentially be merged back, in order for the merge process to work properly. Otherwise, the merge process may not detect all sub-groups, missing those comprised solely of unlisted members.
@@ -1037,17 +1243,17 @@
- Other Configuration Issues + Key JGroups Configuration Tips -
Binding JGroups Channels to a particular interface +
Binding JGroups Channels to a Particular Interface In the Transport Protocols section above, we briefly touched on how the interface to which JGroups will bind sockets is configured. Let's get into this topic in more depth: - First, it's important to understand that the value set in any bind_addr element in an XML configuration file will be ignored by JGroups if it finds that system property jgroups.bind_addr (or a deprecated earlier name for the same thing, bind.address) has been set. The system property trumps XML. If JBoss AS is started with the -b (a.k.a. --host) switch, the AS will set jgroups.bind_addr to the specified value. + First, it's important to understand that the value set in any bind_addr element in an XML configuration file will be ignored by JGroups if it finds that system property jgroups.bind_addr (or a deprecated earlier name for the same thing, bind.address) has been set. The system property trumps XML. If JBoss AS is started with the -b (a.k.a. --host) switch, the AS will set jgroups.bind_addr to the specified value. - Beginning with AS 4.2.0, for security reasons the AS will bind most services to localhost if -b is not set. The effect of this is that in most cases users are going to be setting -b and thus jgroups.bind_addr is going to be set and any XML setting will be ignored. + Beginning with AS 4.2.0, for security reasons the AS will bind most services to localhost if -b is not set. The effect of this is that in most cases users are going to be setting -b and thus jgroups.bind_addr is going to be set and any XML setting will be ignored. So, what are best practices for managing how JGroups binds to interfaces? @@ -1055,7 +1261,7 @@ - Binding JGroups to the same interface as other services. Simple, just use -b: + Binding JGroups to the same interface as other services. Simple, just use -b: ./run.sh -b 192.168.1.100 -c all @@ -1064,7 +1270,7 @@ Binding services (e.g., JBoss Web) to one interface, but use a different one for JGroups: ./run.sh -b 10.0.0.100 -Djgroups.bind_addr=192.168.1.100 -c all - Specifically setting the system property overrides the -b value. This is a common usage pattern; put client traffic on one network, with intra-cluster traffic on another. + Specifically setting the system property overrides the -b value. This is a common usage pattern; put client traffic on one network, with intra-cluster traffic on another. @@ -1080,7 +1286,7 @@ Binding services (e.g., JBoss Web) to all interfaces, but specify the JGroups interface: ./run.sh -b 0.0.0.0 -Djgroups.bind_addr=192.168.1.100 -c all - Again, specifically setting the system property overrides the -b value. + Again, specifically setting the system property overrides the -b value. @@ -1092,77 +1298,219 @@ -This setting tells JGroups to ignore the jgroups.bind_addr system property, and instead use whatever is specfied in XML. You would need to edit the various XML configuration files to set the bind_addr to the desired interfaces. +This setting tells JGroups to ignore the jgroups.bind_addr system property, and instead use whatever is specfied in XML. You would need to edit the various XML configuration files to set the various bind_addr attributes to the desired interfaces.
Isolating JGroups Channels - Within JBoss AS, there are a number of services that independently create JGroups channels -- 3 different JBoss Cache services (used for HttpSession replication, EJB3 SFSB replication and EJB3 entity replication) along with the general purpose clustering service called HAPartition that underlies most other JBossHA services. + Within JBoss AS, there are a number of services that independently create JGroups channels -- possibly multiple different JBoss Cache services (used for HttpSession replication, EJB3 SFSB replication and EJB3 entity replication), two JBoss Messaging channels, and the general purpose clustering service called HAPartition that underlies most other JBossHA services. It is critical that these channels only communicate with their intended peers; not with the channels used by other services and not with channels for the same service opened on machines not meant to be part of the group. Nodes improperly communicating with each other is one of the most common issues users have with JBoss AS clustering. - Whom a JGroups channel will communicate with is defined by its group name, multicast address, and multicast port, so isolating JGroups channels comes down to ensuring different channels use different values for the group name, multicast address and multicast port. - - - To isolate JGroups channels for different services on the same set of AS instances from each other, you MUST change the group name and the multicast port. In other words, each channel must have its own set of values. + Whom a JGroups channel will communicate with is defined by its group name and, for UDP-based channels, its multicast address and port. So isolating JGroups channels comes down to ensuring different channels use different values for the group name, the multicast address and, in some cases, the multicast port. - - For example, say we have a production cluster of 3 machines, each of which has an HAPartition deployed along with a JBoss Cache used for web session clustering. The HAPartition channels should not communicate with the JBoss Cache channels. They should use a different group name and multicast port. They can use the same multicast address, although they don't need to. - - - To isolate JGroups channels for the same service from other instances of the service on the network, you MUST change ALL three values. Each channel must have its own group name, multicast address, and multicast port. + +
+ Isolating Sets of AS Instances from Each Other + + The issue being addressed here is the case where, in the + same environment, you have multiple independent clusters running. For example, + a production cluster, a staging cluster and a QA cluster. Or multiple + clusters in a QA test lab or in a dev team environment. Or a large set of + production machines divided into multiple clusters. + + + To isolate JGroups clusters from other clusters on the network, you need to: + + + + Make sure the channels in the various clusters use different group names. This is easily to control from the command line arguments used to start JBoss; see . + Make sure the channels in the various clusters use different multicast addresses. This is also easy to control from the command line arguments used to start JBoss; see . + + If you are not running on Linux, Windows, Solaris or HP-UX, you may + also need to ensure that the channels in each cluster use different + multicast ports. This is quite a bit more troublesome then using different + group names, although it can still be controlled from the command line. + See . Note + that using different ports should not be necessary if your servers are + running on Linux, Windows, Solaris or HP-UX. + + +
+ +
+ Isolating Channels for Different Services on the Same Set of AS Instances + + The issue being addressed here is the normal case where + we have a cluster of 3 machines, each of which + has, for example, an HAPartition deployed along with a JBoss Cache used for web session + clustering. The HAPartition channels should not communicate with the + JBoss Cache channels. Ensuring proper isolation of these channels is straightforward, + and generally speaking the AS handles it for you without any special effort + on your part. So most readers can skip this section. + + + To isolate JGroups channels for different services on the same set of AS instances from each other, + each channel must have its own group name. The configurations that ship + with JBoss AS of course ensure that this is the case. If you create a custom service + that directly uses JGroups, just make sure you use a unique group name. + If you create a custom JBoss Cache configuration, make sure you provide + a unique value in the clusterName configuration property. + - For example, say we have a production cluster of 3 machines, each of which has an HAPartition deployed. On the same network there is also a QA cluster of 3 machines, which also has an HAPartition deployed. The HAPartition group name, multicast address, and multicast port for the production machines must be different from those used on the QA machines. + In releases prior to AS 5, different channels running in the same AS also had to use unique multicast ports. + With the JGroups shared transport introduced in AS 5 (see + ), it is + now common for multiple channels to use the same tranpsort protocol and its sockets. + This makes configuration easier, which is one of the main benefits of the shared + transport. However, if you decide to create your own custom JGroups protocol + stack configuration, be sure to configure its transport protocols with a multicast port + that is different from the ports used in other protocol stacks. - +
-
Changing the Group Name +
Changing the Group Name - The group name for a JGroups channel is configured via the service that starts the channel. Unfortunately, different services use different attribute names for configuring this. For HAPartition and related services configured in the deploy/cluster-service.xml file, this is configured via a PartitionName attribute. For JBoss Cache services, the name of the attribute is ClusterName. - - - Starting with JBoss AS 4.0.4, for the HAPartition and all the standard JBoss Cache services, we make it easy for you to create unique groups names simply by using the -g (a.k.a. –partition) switch when starting JBoss: + The group name for a JGroups channel is configured via the service that + starts the channel. For all the standard clustered services, we make it easy + for you to create unique groups names by simply using the -g (a.k.a. --partition) + switch when starting JBoss: ./run.sh -g QAPartition -b 192.168.1.100 -c all - This switch sets the jboss.partition.name system property, which is used as a component in the configuration of the group name in all the standard clustering configuration files. For example, -Tomcat-${jboss.partition.name:Cluster}]]> + This switch sets the jboss.partition.name system property, + which is used as a component in the configuration of the group name in + all the standard clustering configuration files. For example, +${jboss.partition.name:DefaultPartition}-SFSBCache]]>
-
Changing the multicast address and port +
Changing the Multicast Address - The -u (a.k.a. --udp) command line switch may be used to control the multicast address used by the JGroups channels opened by all standard AS services. + The -u (a.k.a. --udp) command line switch may be used to control the multicast address used by the JGroups channels opened by all standard AS services. - This switch sets the jboss.partition.udpGroup system property, which you can see referenced in all of the standard protocol stack configs in JBoss AS: + This switch sets the jboss.partition.udpGroup system property, + which you can see referenced in all of the standard protocol stack configurations in JBoss AS: - - - - - Unfortunately, setting the multicast ports is not so simple. As described above, by default there are four separate JGroups channels in the standard JBoss AS all configuration, and each should be given a unique port. There are no command line switches to set these, but the standard configuration files do use system properties to set them. So, they can be configured from the command line by using -D. For example, - - - /run.sh -u 230.1.2.3 -g QAPartition -Djboss.hapartition.mcast_port=12345 -Djboss.webpartition.mcast_port=23456 -Djboss.ejb3entitypartition.mcast_port=34567 -Djboss.ejb3sfsbpartition.mcast_port=45678 -b 192.168.1.100 -c all - - -Why isn't it sufficient to change the group name? + + + + Why isn't it sufficient to change the group name? If channels with different group names share the same multicast address and port, the lower level JGroups protocols in each channel will see, process and eventually discard messages intended for the other group. This will at a minimum hurt performance and can lead to anomalous behavior. - - Why do I need to change the multicast port if I change the address? + +
+ +
+ Changing the Multicast Port + + On some operating systems (Mac OS X for example), using different + -g and -u values isn't sufficient + to isolate clusters; the channels running in the different clusters + need to use different multicast ports. Unfortunately, setting the + multicast ports is not quite as simple as -g and + -u. By default, a JBoss AS instance + running the all configuration will use up to two different instances of + the JGroups UDP transport protocol, and will thus open two + multicast sockets. You can control the ports those sockets use + by using system properties on the command line. For example, + + +/run.sh -u 230.1.2.3 -g QAPartition -b 192.168.1.100 -c all \\ + -Djboss.jgroups.udp.mcast_port=12345 -Djboss.messaging.datachanneludpport=23456 + + + The jboss.messaging.datachanneludpport property controls + the multicast port used by the MPING protocol in JBoss Messaging's DATA channel. + The jboss.jgroups.udp.mcast_port property controls the + multicast port used by the UDP transport protocol shared by all other clustered services. + + The set of JGroups protocol stack configurations included in the + $JBOSS_HOME/server/all/cluster/jgroups-channelfactory.sar/META-INF/jgroups-channelfactory-stacks.xml + file includes a number of other example protocol stack configurations that + the standard AS distribution doesn't actually use. Those configurations also + use system properties to set any multicast ports. So, if you reconfigure some + AS service to use one of those protocol stack configurations, just use the + appropriate system property to control the port from the command line. + + Why do I need to change the multicast port if I change the address? - It should be sufficient to just change the address, but there is a problem on several operating systems whereby packets addressed to a particular multicast port are delivered to all listeners on that port, regardless of the multicast address they are listening on. So the recommendation is to change both the address and the port. + It should be sufficient to just change the address, but unfortunately the + handling of multicast sockets is one area where the JVM fails to hide + OS behavior differences from the application. The java.net.MulticastSocket + class provides different overloaded constructors. On some operating + systems, if you use one constructor variant, there is a problem whereby + packets addressed to a particular multicast port are delivered to all + listeners on that port, regardless of the multicast address on which they are + listening. We refer to this as the "promiscuous traffic" problem. + On most operating systems that exhibit the promiscuous traffic problem + (i.e. Linux, Solaris and HP-UX) JGroups can use a different constructor + variant that avoids the problem. However, on some OSs with the + promiscuous traffic problem (e.g. Mac OS X), multicast does not work + properly if the other constructor variant is used. So, on these + operating systems the recommendation is to configure different + multicast ports for different clusters. +
+ +
+ Improving UDP Performance by Configuring OS UDP Buffer Limits + By default, the JGroups channels in JBoss AS use the UDP transport + protocol in order to take advantage of IP multicast. However, one disadvantage + of UDP is it does not come with the reliable delivery guarantees + provided by TCP. The protocols discussed in + allow JGroups to guarantee delivery of + UDP messages, but those protocols are implemented in Java, not at the + OS network layer. To get peak performance from a UDP-based JGroups + channel it is important to limit the need for JGroups to retransmit messages + by limiting UDP datagram loss. + + One of the most common causes of lost UDP datagrams is an undersized receive + buffer on the socket. The UDP protocol's mcast_recv_buf_size + and ucast_recv_buf_size configuration attributes + are used to specify the amount of receive buffer JGroups requests + from the OS, but the actual size of the buffer the OS will provide + is limited by OS-level maximums. These maximums are often very low: + + + Default Max UDP Buffer Sizes + + Operating SystemDefault Max UDP Buffer (in bytes) + + Linux131071 + WindowsNo known limit + Solaris262144 + FreeBSD, Darwin262144 + AIX1048576 + + +
+ + The command used to increase the above limits is OS-specific. The table + below shows the command required to increase the maximum buffer to 25MB. + In all cases root privileges are required: + + + Commands to Change Max UDP Buffer Sizes + + Operating SystemCommand + + Linuxsysctl -w net.core.rmem_max=26214400 + Solarisndd -set /dev/udp udp_max_buf 26214400 + FreeBSD, Darwinsysctl -w kern.ipc.maxsockbuf=26214400 + AIXno -o sb_max=8388608 (AIX will only allow 1MB, 4MB or 8MB). + + +
+
JGroups Troubleshooting