Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: AMQ 7.10.3.GA
Affects Version/s: AMQ 7.10.0.GA
Component/s: None
Labels:
- CR1
- upstream-test-coverage

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
GSS Priority:
Target Release:

AMQ 7.10.3.GA
Upstream Jira:
https://issues.apache.org/jira/browse/ARTEMIS-4143
Steps to Reproduce:
Hide

Start master and backup servers using NFS, as usual.

Disconnect network between master and NFS server using tc commands

Wait for the backup to become live status (for about 90 seconds)

Restore the network between master and NFS server using tc commands
=> Both master and backup will go live with split brain.
=> Messages could be sent to both master and backup.
=> After that, "kernel: NFS: __nfs4_reclaim_open_state: Lock reclaim failed!” is logged in /var/log/messages in master but master doesn’t stop being live status.

tc command: I used tc command wrapper, tcconfig, on the master server to emulate network disconnect ~~~ # tcset eth0 --loss 100% --network 10.0.1.172 --overwrite --tc-command /usr/sbin/tc qdisc del dev eth0 root /usr/sbin/tc qdisc del dev eth0 ingress /usr/sbin/tc qdisc del dev ifb6682 root /usr/sbin/ip link set dev ifb6682 down /usr/sbin/ip link delete ifb6682 type ifb /usr/sbin/tc qdisc add dev eth0 root handle 1a1a: htb default 1 /usr/sbin/tc class add dev eth0 parent 1a1a: classid 1a1a:1 htb rate 32000000.0kbit /usr/sbin/tc class add dev eth0 parent 1a1a: classid 1a1a:96 htb rate 32000000.0Kbit ceil 32000000.0Kbit /usr/sbin/tc qdisc add dev eth0 parent 1a1a:96 handle 2e17: netem loss 100.000000% /usr/sbin/tc filter add dev eth0 protocol ip parent 1a1a: prio 5 u32 match ip dst 10.0.1.172/32 match ip src 0.0.0.0/0 flowid 1a1a:96
Show
Start master and backup servers using NFS, as usual. Disconnect network between master and NFS server using tc commands Wait for the backup to become live status (for about 90 seconds) Restore the network between master and NFS server using tc commands => Both master and backup will go live with split brain. => Messages could be sent to both master and backup. => After that, "kernel: NFS: __nfs4_reclaim_open_state: Lock reclaim failed!” is logged in /var/log/messages in master but master doesn’t stop being live status. tc command: I used tc command wrapper, tcconfig, on the master server to emulate network disconnect ~~~ # tcset eth0 --loss 100% --network 10.0.1.172 --overwrite --tc-command /usr/sbin/tc qdisc del dev eth0 root /usr/sbin/tc qdisc del dev eth0 ingress /usr/sbin/tc qdisc del dev ifb6682 root /usr/sbin/ip link set dev ifb6682 down /usr/sbin/ip link delete ifb6682 type ifb /usr/sbin/tc qdisc add dev eth0 root handle 1a1a: htb default 1 /usr/sbin/tc class add dev eth0 parent 1a1a: classid 1a1a:1 htb rate 32000000.0kbit /usr/sbin/tc class add dev eth0 parent 1a1a: classid 1a1a:96 htb rate 32000000.0Kbit ceil 32000000.0Kbit /usr/sbin/tc qdisc add dev eth0 parent 1a1a:96 handle 2e17: netem loss 100.000000% /usr/sbin/tc filter add dev eth0 protocol ip parent 1a1a: prio 5 u32 match ip dst 10.0.1.172/32 match ip src 0.0.0.0/0 flowid 1a1a:96

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

[Problem]

Sprit brain occurs when to disconnect network between master and NFS server temporary
- When to disconnect network between master and NFS server using tc command, the backup goes live status after a few minutes,
- Then the network is restored immediately after confirming that the backup has gone live, the master and backup will go live status.
- Split brain would cause data inconsistencies, including data loss in the worst case.

[Environment]

Red Hat AMQ Broker 7.10

master/backup 1 pair servers
Shared store HA using NFS
master and backup respectively
- on Red Hat Enterprise Linux release 8.6 (Ootpa) on AWS EC2 t2.large(vCPU:2, Memory:8GB)

NFS mount is below, this uses recommended options following our documentation: https://access.redhat.com/documentation/en-us/red_hat_amq_broker/7.10/html-single/configuring_amq_broker/index#con_br-configuring-nfs-shared-store_configuring

           $ mount -t nfs4 -o nfsvers=4,rsize=1048576,wsize=1048576,sync,intr,soft,noac,lookupcache=none,timeo=600,retrans=2,noresvport 10.0.1.172:/share data
           ...
           $ nfsstat -m
                /home/ec2-user/amq-broker-7.10.0/broker0/data from 10.0.1.172:/share
                Flags:rw,sync,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.9.39,lookupcache=none,local_lock=none,addr=10.0.1.172

NFS Server 4.2 (nfs-utils-2.3.3-51.el8.src.rpm)
- on Red Hat Enterprise Linux release 8.6 (Ootpa) on AWS EC2 t2.large(vCPU:2, Memory:8GB)

[For your reference]

If the network between the master and the NFS server is left disconnected without restoring,
- "WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: Input/output error" will occur on the master after about 3 minutes,
- then the master will be shutdown by "ERROR [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=IO Error while calculating disk usage: java.nio.file.FileSystemException: /home/ec2-user/amq-broker-7.10.0/broker0/data/paging: Input/output error" after about 8 minutes.
In other words, if the backup goes live and the connection is restored by about 8 minutes, which the master shuts down, split brain seems to occur.
According to Gary, it is also related to this ticket: https://issues.apache.org/jira/browse/AMQ-4705
NFS "intr" option seems to be basically unavailable because it was deprecated: https://access.redhat.com/solutions/157873.

[As long as, I read source code using debugger]

I suspect the problem cause is that FileLock.isValid() with NFS is unreliable.
The problem may be insufficient double-checking for the live file lock by AMQ Broker. (or maybe a bug in the FileChannel/FileLock)
- From reading the FileLockNodeManager source code[2] and using the debugger, FileLockNodeManager relies on FileChannel and FileLock for locking live file(data/journal/serverlock.1), and periodically checks whether live file lock is held using FileLock.isValid().
  ~~However, FileLock.isValid() seems unreliable as described in the AMQ Broker source code comments[L510-L511].~~
- In the master server debug log file, "Server still has the lock, double check status is live” logs were repeatedly output after the split blain[1], even after "kernel: NFS: __nfs4_reclaim_open_state: Lock reclaim failed!” is logged in /var/log/messages in master server. It is notified that the lock is lost up to the NFS client layer, but FileLock.isValid()==true still in the JVM layer.
- And I also checked the backup server with the debugger, but FileLock.isValid()==true for locking live file(data/journal/serverlock.1).
  The master and backup each received FileLock.isValid()==true while split blain.
- I don't know if getState()[L517] in FileLockNodeManager is intended to double-check live lock or not, but it doesn't seem to be helping to detect the split brain, at least for this problem.

[1] master log

2022-09-08 11:54:07,440 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Server still has the lock, double check status is live
2022-09-08 11:54:07,440 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] getting state...
2022-09-08 11:54:07,440 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying to lock position: 0
2022-09-08 11:54:07,441 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] locked position: 0
2022-09-08 11:54:07,442 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] state: 76
2022-09-08 11:54:07,442 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Status is set to live
...

[2] FileLockNodeManager source code

https://github.com/apache/activemq-artemis/blob/2.21.0/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java

clones

ENTMQBR-7130 Split brain occurs when to disconnect network between master and NFS server temporary

Closed

relates to

ENTMQBR-7855 [QE] Create a test for ENTMQBR-7130

Refinement

links to

AMQ-4705

ARTEMIS-2421 Implement periodic journal lock evaluation

Assignee:: Justin Bertram

Reporter:: Tomonari Yamashita

Tester:: Samuel Gajdos

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/03/31 3:27 PM

Updated:: 2023/05/24 7:55 AM

Resolved:: 2023/04/27 8:16 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates