-
Bug
-
Resolution: Done
-
Blocker
-
3.20.0.GA
-
False
-
-
False
-
-
-
-
-
0
-
Customer Reported
-
+
Failover in ClusteredRoutePolicyFactory using the FileLockClusterService is unreliable when running multiple JVMs. During disaster-recovery testing, the standby Camel instance does not consistently take over leadership after the active node loses access to the lock file. We believe this is caused by using System.nanoTime() across JVM boundaries, which the JDK specifies is not valid.
We use Camel Quarkus with ClusteredRoutePolicyFactory and FileLockClusterService across two containerized JVMs. During disaster recovery testing, we simulate a failure by blocking access from the active Camel instance to the shared lock file. The active instance correctly shuts down, but the standby instance does not always assume leadership. The behavior is nondeterministic and varies across test runs.
After investigation, we found that leader staleness detection uses:
final long elapsed = currentNanoTime - previousObservedHeartbeat;
Where:
currentNanoTime is obtained from System.nanoTime() in the standby JVM
previousObservedHeartbeat is the heartbeat timestamp written using System.nanoTime() from the former leader JVM
However, according to the System.nanoTime() Javadoc:
“The same origin is used by all invocations of this method in an instance of a Java virtual machine; other virtual machine instances are likely to use a different origin. This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.”
This means System.nanoTime() values cannot be compared between JVMs, making the calculation of elapsed invalid and causing nondeterministic failover behavior.
Environment:
Red Hat Camel Quarkus
Camel ClusteredRoutePolicyFactory
FileLockClusterService
Two containerized JVMs (Podman) sharing a lock file
Disaster recovery test: block access to the lock file on one node via firewall
Root Cause Analysis:
Heartbeat timestamps written into the lock file use System.nanoTime() from the leader JVM.
Staleness detection compares that value with System.nanoTime() from a different JVM.
JVMs have different nanoTime origins, so the subtraction yields meaningless and unpredictable values.
This directly violates the JDK’s documented use of System.nanoTime() and explains the nondeterministic failover behavior.
- links to
-
RHEA-2026:157920
Red Hat Build of Apache Camel 4.10 for Quarkus 3.20 update is now available (RHBQ 3.20.4.SP2)
- mentioned on