Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: AMQ 7.8.0.CR1
Affects Version/s: AMQ 7.4.3.GA
Component/s: clustering
Labels:
None

GSS Priority:
QE Test Coverage:
-
Release Note Text:

Hide
In the event of a network outage, it is possible for both brokers in a live-backup group to become live at the same time (a situation known as "network isolation" or "split brain"). Previously, if this situation occurred, any connected AMQ Core Protocol JMS clients received incorrect broker topology information. As a result, when the network and split brain issues were solved, the client could not reconnect to the right brokers. To work around this issue, you needed to restart the clients. This issue is now resolved.

Show
In the event of a network outage, it is possible for both brokers in a live-backup group to become live at the same time (a situation known as "network isolation" or "split brain"). Previously, if this situation occurred, any connected AMQ Core Protocol JMS clients received incorrect broker topology information. As a result, when the network and split brain issues were solved, the client could not reconnect to the right brokers. To work around this issue, you needed to restart the clients. This issue is now resolved.
Release Note Status:
Documented as Resolved Issue
Target Release:

AMQ 7.8.0.GA
Upstream Jira:
ARTEMIS-2587 ARTEMIS-2858 ARTEMIS-2867 ARTEMIS-2867 ARTEMIS-2868
Verified:
Verified in a release
Steps to Reproduce:
Hide

The crux of the issue seems to be a topology that fails to rebuild after an interruption of network services. It is relatively easy to reproduce, though not sure how common the issue would be in practice.

General reproducer instructions:

1. Set up a 2-node broker cluster (reproducer is replicated with network pinger) using wildcard addresses for the acceptors, fully-qualified hostnames for the cluster connectors and fully-qualified hostnames for the network ping targets.

configure live broker host with 1 source of DNS information (e.g. DNS server)

configure backup broker host with 2 sources of DNS information (2 DNS servers or DNS + hosts)

configure network pinger with 2 hosts - I used the 2 DNS servers

2. Start the brokers and wait for them to settle and replicate initially
3. Stop the source of DNS information for the master broker (I stopped the DNS service, leaving the host up)
4. Wait for master broker to start logging UnknownHostException errors
5. Interrupt the connection between live and backup broker. I did this by temporarily disabling the relevant interface on the live node with the attached network-interrupt.sh script.
6. Observer both brokers are live
7. Restart DNS service
8. Both brokers remain live
9. Restart backup broker
10. Backup fails to reconnect, and topologies for both brokers show zero nodes and zero members.

Issue seems to be that master loses the topology information and fails to rebuild it, so acceptor used for CORE fails to transmit topology to slave and slave cannot reconnect.
Show
The crux of the issue seems to be a topology that fails to rebuild after an interruption of network services. It is relatively easy to reproduce, though not sure how common the issue would be in practice. General reproducer instructions: 1. Set up a 2-node broker cluster (reproducer is replicated with network pinger) using wildcard addresses for the acceptors, fully-qualified hostnames for the cluster connectors and fully-qualified hostnames for the network ping targets. configure live broker host with 1 source of DNS information (e.g. DNS server) configure backup broker host with 2 sources of DNS information (2 DNS servers or DNS + hosts) configure network pinger with 2 hosts - I used the 2 DNS servers 2. Start the brokers and wait for them to settle and replicate initially 3. Stop the source of DNS information for the master broker (I stopped the DNS service, leaving the host up) 4. Wait for master broker to start logging UnknownHostException errors 5. Interrupt the connection between live and backup broker. I did this by temporarily disabling the relevant interface on the live node with the attached network-interrupt.sh script. 6. Observer both brokers are live 7. Restart DNS service 8. Both brokers remain live 9. Restart backup broker 10. Backup fails to reconnect, and topologies for both brokers show zero nodes and zero members. Issue seems to be that master loses the topology information and fails to rebuild it, so acceptor used for CORE fails to transmit topology to slave and slave cannot reconnect.

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Due to temporary network issue master and backup lost their cluster connection. Master broker continued to work while backup broker could not reestablish its connection to master despite repeated restarts.

There were not errors in either master of backup broker log file.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

backup-broker.xml
2020/08/01 4:15 PM
9 kB
Duane Hawkins
live-broker.xml
2020/08/01 4:15 PM
9 kB
Duane Hawkins
network-interrupt.sh
2020/08/01 4:15 PM
0.1 kB
Duane Hawkins

is cloned by

ENTMQBR-3803 [LTS] Backup broker cannot reestablish connection with its master

Closed

Assignee:: Clebert Suconic

Reporter:: Tom Ross

Tester:: Mikhail Krutov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2020/07/20 11:13 AM

Updated:: 2023/10/07 5:02 AM

Resolved:: 2020/11/27 6:19 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates