-
Bug
-
Resolution: Done
-
Critical
-
AMQ 7.10.0.GA
-
None
-
False
-
None
-
False
-
-
[Problem]
- Sprit brain occurs when to disconnect network between master and NFS server temporary
- When to disconnect network between master and NFS server using tc command, the backup goes live status after a few minutes,
- Then the network is restored immediately after confirming that the backup has gone live, the master and backup will go live status.
- Split brain would cause data inconsistencies, including data loss in the worst case.
[Environment]
- Red Hat AMQ Broker 7.10
- master/backup 1 pair servers
- Shared store HA using NFS
- master and backup respectively
- on Red Hat Enterprise Linux release 8.6 (Ootpa) on AWS EC2 t2.large(vCPU:2, Memory:8GB)
- NFS mount is below, this uses recommended options following our documentation: https://access.redhat.com/documentation/en-us/red_hat_amq_broker/7.10/html-single/configuring_amq_broker/index#con_br-configuring-nfs-shared-store_configuring
$ mount -t nfs4 -o nfsvers=4,rsize=1048576,wsize=1048576,sync,intr,soft,noac,lookupcache=none,timeo=600,retrans=2,noresvport 10.0.1.172:/share data ... $ nfsstat -m /home/ec2-user/amq-broker-7.10.0/broker0/data from 10.0.1.172:/share Flags:rw,sync,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,noresvport,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.0.9.39,lookupcache=none,local_lock=none,addr=10.0.1.172
- NFS Server 4.2 (nfs-utils-2.3.3-51.el8.src.rpm)
- on Red Hat Enterprise Linux release 8.6 (Ootpa) on AWS EC2 t2.large(vCPU:2, Memory:8GB)
[For your reference]
- If the network between the master and the NFS server is left disconnected without restoring,
- "WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: Input/output error" will occur on the master after about 3 minutes,
- then the master will be shutdown by "ERROR [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=IO Error while calculating disk usage: java.nio.file.FileSystemException: /home/ec2-user/amq-broker-7.10.0/broker0/data/paging: Input/output error" after about 8 minutes.
- In other words, if the backup goes live and the connection is restored by about 8 minutes, which the master shuts down, split brain seems to occur.
- According to Gary, it is also related to this ticket: https://issues.apache.org/jira/browse/AMQ-4705
- NFS "intr" option seems to be basically unavailable because it was deprecated: https://access.redhat.com/solutions/157873.
[As long as, I read source code using debugger]
- I suspect the problem cause is that FileLock.isValid() with NFS is unreliable.
The problem may be insufficient double-checking for the live file lock by AMQ Broker. (or maybe a bug in the FileChannel/FileLock)- From reading the FileLockNodeManager source code[2] and using the debugger, FileLockNodeManager relies on FileChannel and FileLock for locking live file(data/journal/serverlock.1), and periodically checks whether live file lock is held using FileLock.isValid().
However, FileLock.isValid() seems unreliable as described in the AMQ Broker source code comments[L510-L511]. - In the master server debug log file, "Server still has the lock, double check status is live” logs were repeatedly output after the split blain[1], even after "kernel: NFS: __nfs4_reclaim_open_state: Lock reclaim failed!” is logged in /var/log/messages in master server. It is notified that the lock is lost up to the NFS client layer, but FileLock.isValid()==true still in the JVM layer.
- And I also checked the backup server with the debugger, but FileLock.isValid()==true for locking live file(data/journal/serverlock.1).
The master and backup each received FileLock.isValid()==true while split blain. - I don't know if getState()[L517] in FileLockNodeManager is intended to double-check live lock or not, but it doesn't seem to be helping to detect the split brain, at least for this problem.
- From reading the FileLockNodeManager source code[2] and using the debugger, FileLockNodeManager relies on FileChannel and FileLock for locking live file(data/journal/serverlock.1), and periodically checks whether live file lock is held using FileLock.isValid().
[1] master log
2022-09-08 11:54:07,440 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Server still has the lock, double check status is live
2022-09-08 11:54:07,440 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] getting state...
2022-09-08 11:54:07,440 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying to lock position: 0
2022-09-08 11:54:07,441 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] locked position: 0
2022-09-08 11:54:07,442 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] state: 76
2022-09-08 11:54:07,442 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Status is set to live
...
[2] FileLockNodeManager source code
- is cloned by
-
ENTMQBR-7652 [EAP] Split brain occurs when to disconnect network between master and NFS server temporary
- Closed
-
ENTMQBR-7716 [LTS] Split brain occurs when to disconnect network between master and NFS server temporary
- Closed
-
ENTMQBR-7881 [LTS] Split brain occurs when to disconnect network between master and NFS server temporary
- Closed
- relates to
-
ENTMQBR-8078 Unhandled NullPointerException in JournalTransaction::forget
- Closed
-
ENTMQBR-7855 [QE] Create a test for ENTMQBR-7130
- New
- links to