Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-3752

Backup broker cannot reestablish connection with its master

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Done
    • Affects Version/s: AMQ 7.4.3.GA
    • Fix Version/s: AMQ 7.8.0.CR1
    • Component/s: clustering
    • Labels:
      None
    • Target Release:
    • Steps to Reproduce:
      Hide

      The crux of the issue seems to be a topology that fails to rebuild after an interruption of network services. It is relatively easy to reproduce, though not sure how common the issue would be in practice.

      General reproducer instructions:

      1. Set up a 2-node broker cluster (reproducer is replicated with network pinger) using wildcard addresses for the acceptors, fully-qualified hostnames for the cluster connectors and fully-qualified hostnames for the network ping targets.

      • configure live broker host with 1 source of DNS information (e.g. DNS server)
      • configure backup broker host with 2 sources of DNS information (2 DNS servers or DNS + hosts)
      • configure network pinger with 2 hosts - I used the 2 DNS servers

      2. Start the brokers and wait for them to settle and replicate initially
      3. Stop the source of DNS information for the master broker (I stopped the DNS service, leaving the host up)
      4. Wait for master broker to start logging UnknownHostException errors
      5. Interrupt the connection between live and backup broker. I did this by temporarily disabling the relevant interface on the live node with the attached network-interrupt.sh script.
      6. Observer both brokers are live
      7. Restart DNS service
      8. Both brokers remain live
      9. Restart backup broker
      10. Backup fails to reconnect, and topologies for both brokers show zero nodes and zero members.

      Issue seems to be that master loses the topology information and fails to rebuild it, so acceptor used for CORE fails to transmit topology to slave and slave cannot reconnect.

      Show
      The crux of the issue seems to be a topology that fails to rebuild after an interruption of network services. It is relatively easy to reproduce, though not sure how common the issue would be in practice. General reproducer instructions: 1. Set up a 2-node broker cluster (reproducer is replicated with network pinger) using wildcard addresses for the acceptors, fully-qualified hostnames for the cluster connectors and fully-qualified hostnames for the network ping targets. configure live broker host with 1 source of DNS information (e.g. DNS server) configure backup broker host with 2 sources of DNS information (2 DNS servers or DNS + hosts) configure network pinger with 2 hosts - I used the 2 DNS servers 2. Start the brokers and wait for them to settle and replicate initially 3. Stop the source of DNS information for the master broker (I stopped the DNS service, leaving the host up) 4. Wait for master broker to start logging UnknownHostException errors 5. Interrupt the connection between live and backup broker. I did this by temporarily disabling the relevant interface on the live node with the attached network-interrupt.sh script. 6. Observer both brokers are live 7. Restart DNS service 8. Both brokers remain live 9. Restart backup broker 10. Backup fails to reconnect, and topologies for both brokers show zero nodes and zero members. Issue seems to be that master loses the topology information and fails to rebuild it, so acceptor used for CORE fails to transmit topology to slave and slave cannot reconnect.
    • Release Notes Text:
      Hide
      In the event of a network outage, it is possible for both brokers in a live-backup group to become live at the same time (a situation known as "network isolation" or "split brain"). Previously, if this situation occurred, any connected AMQ Core Protocol JMS clients received incorrect broker topology information. As a result, when the network and split brain issues were solved, the client could not reconnect to the right brokers. To work around this issue, you needed to restart the clients. This issue is now resolved.
      Show
      In the event of a network outage, it is possible for both brokers in a live-backup group to become live at the same time (a situation known as "network isolation" or "split brain"). Previously, if this situation occurred, any connected AMQ Core Protocol JMS clients received incorrect broker topology information. As a result, when the network and split brain issues were solved, the client could not reconnect to the right brokers. To work around this issue, you needed to restart the clients. This issue is now resolved.
    • Release Notes Docs Status:
      Documented as Resolved Issue
    • QE Test Coverage:
      -
    • Upstream Jira:
      ARTEMIS-2587 ARTEMIS-2858 ARTEMIS-2867 ARTEMIS-2867 ARTEMIS-2868
    • Verified:
      Verified in a release

      Description

      Due to temporary network issue master and backup lost their cluster connection. Master broker continued to work while backup broker could not reestablish its connection to master despite repeated restarts. 

      There were not errors in either master of backup broker log file. 

        Attachments

        1. backup-broker.xml
          9 kB
        2. live-broker.xml
          9 kB
        3. network-interrupt.sh
          0.1 kB

          Issue Links

            Activity

              People

              Assignee:
              clebert.suconic Clebert Suconic
              Reporter:
              raggz Tom Ross
              Tester:
              Mikhail Krutov Mikhail Krutov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: