In multiple failover/failback test where live server is killed backup server fails to wake up on NFS 4.1.
I've managed to monitor server.lock file during this scenario.
When live (pid 9001) is working and backup (pid 12127) waits for live to fail it prints:
[hudson@messaging-12 journal]$ lsof server.lock
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 9001 hudson 470u REG 0,41 19 138 server.lock
java 12127 hudson 463u REG 0,41 19 138 server.lock
After live is killed only backup remains with open FD on this file:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 12127 hudson 463u REG 0,41 19 138 server.lock
So everything looks good. But backup fails to detect live failure and doesn't become alive.
We also have thread dump from this.
I tried to reproduce the same issue on NFS 4.0 and everything seems to be working fine.
Customer impact: high availability of HA topology will be questionable, as failover mechanism is not reliable on NFS 4.1.
This is regression against EAP 7.0, where we are not able to reproduce this issue.
We didn't encountered this before because this scenario failed on https://issues.jboss.org/browse/JBEAP-10704
- is related to
-
JBEAP-12872 Artemis is not be able to guarantee HA on NFSv4 on RHEL 7.4
-
- Closed
-