Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-7515

Cluster Does Not Reliably Recover after DNS Outage

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • AMQ 7.10.0.GA
    • broker-core, clustering
    • None
    • False
    • None
    • False
    • Hide

      1. Configure a 3-node replicated HA cluster using hostnames for connector / acceptor configuration
      2. Configure a DNS server with the hostnames of the 3 nodes
      3. Edit /etc/resolv.conf on each of the broker nodes to use only the DNS server in (2)
      4. Make sure /etc/hosts contains only the default entries and not the hostnames of the broker hosts
      5. Start the cluster and inspect the topology to ensure that it correctly reports 6 nodes / 3 members
      6. Interrupt the connectivity to the DNS server for some time. I used a script like this one on the DNS server to accomplish this:

      #!/bin/sh
      
      echo "Sleeping..."
      sleep 10
      ip link set eth0 down;
      echo "Killing network..."
      sleep 150
      ip link set eth0 up;
      

      6. While the DNS connectivity is down, restart a live and backup node in the cluster
      7. Wait for DNS connectivity to be restored
      8. Monitor the cluster to see if it is fully restored following the outage

      It may take a few tries, but eventually you get to a state where the cluster is missing members, even after DNS connectivity is restored and some time passes.

      Show
      1. Configure a 3-node replicated HA cluster using hostnames for connector / acceptor configuration 2. Configure a DNS server with the hostnames of the 3 nodes 3. Edit /etc/resolv.conf on each of the broker nodes to use only the DNS server in (2) 4. Make sure /etc/hosts contains only the default entries and not the hostnames of the broker hosts 5. Start the cluster and inspect the topology to ensure that it correctly reports 6 nodes / 3 members 6. Interrupt the connectivity to the DNS server for some time. I used a script like this one on the DNS server to accomplish this: #!/bin/sh echo "Sleeping..." sleep 10 ip link set eth0 down; echo "Killing network..." sleep 150 ip link set eth0 up; 6. While the DNS connectivity is down, restart a live and backup node in the cluster 7. Wait for DNS connectivity to be restored 8. Monitor the cluster to see if it is fully restored following the outage It may take a few tries, but eventually you get to a state where the cluster is missing members, even after DNS connectivity is restored and some time passes.
    • Important

      If a live or backup broker is restarted during a DNS outage, it does not reliably rejoin the cluster, even after DNS service is restored and even with cluster reconnect-attempts set to -1 / unlimited. Instead the topology remains broken until a restart of the cluster is performed after DNS service is restored.

            Unassigned Unassigned
            rhn-support-dhawkins Duane Hawkins
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: