Uploaded image for project: 'Red Hat Build of Apache Camel for Quarkus'
  1. Red Hat Build of Apache Camel for Quarkus
  2. CEQ-12238

Failover in ClusteredRoutePolicyFactory using the FileLockClusterService is unreliable when running multiple JVMs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • QP-3.20.4.SP2
    • 3.20.0.GA
    • Camel
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Deploy two Camel Quarkus instances in separate containers (each its own JVM) using ClusteredRoutePolicyFactory with the file lock cluster service.

      Ensure both access the same lock file on the host.

      Allow one node to become leader.

      Block filesystem access to the lock file for the leader using a firewall rule.

      Leader shuts down as expected.

      Observe whether the standby instance becomes leader.

      Actual Results:
      Failover is inconsistent:
      Sometimes the standby JVM becomes leader.
      Sometimes it does not.
      Behavior varies between test runs without configuration changes.

      Expected Results:
      Failover should be deterministic and reliable.
      Standby instance should always become the leader when the active leader stops updating the lock file.

      Show
      Deploy two Camel Quarkus instances in separate containers (each its own JVM) using ClusteredRoutePolicyFactory with the file lock cluster service. Ensure both access the same lock file on the host. Allow one node to become leader. Block filesystem access to the lock file for the leader using a firewall rule. Leader shuts down as expected. Observe whether the standby instance becomes leader. Actual Results: Failover is inconsistent: Sometimes the standby JVM becomes leader. Sometimes it does not. Behavior varies between test runs without configuration changes. Expected Results: Failover should be deterministic and reliable. Standby instance should always become the leader when the active leader stops updating the lock file.
    • 0
    • Customer Reported
    • +

      Failover in ClusteredRoutePolicyFactory using the FileLockClusterService is unreliable when running multiple JVMs. During disaster-recovery testing, the standby Camel instance does not consistently take over leadership after the active node loses access to the lock file. We believe this is caused by using System.nanoTime() across JVM boundaries, which the JDK specifies is not valid.

      We use Camel Quarkus with ClusteredRoutePolicyFactory and FileLockClusterService across two containerized JVMs. During disaster recovery testing, we simulate a failure by blocking access from the active Camel instance to the shared lock file. The active instance correctly shuts down, but the standby instance does not always assume leadership. The behavior is nondeterministic and varies across test runs.

      After investigation, we found that leader staleness detection uses:

      final long elapsed = currentNanoTime - previousObservedHeartbeat;

      Where:

      currentNanoTime is obtained from System.nanoTime() in the standby JVM

      previousObservedHeartbeat is the heartbeat timestamp written using System.nanoTime() from the former leader JVM

      However, according to the System.nanoTime() Javadoc:

      The same origin is used by all invocations of this method in an instance of a Java virtual machine; other virtual machine instances are likely to use a different origin. This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.

      This means System.nanoTime() values cannot be compared between JVMs, making the calculation of elapsed invalid and causing nondeterministic failover behavior.

      Environment:
      Red Hat Camel Quarkus
      Camel ClusteredRoutePolicyFactory
      FileLockClusterService
      Two containerized JVMs (Podman) sharing a lock file

      Disaster recovery test: block access to the lock file on one node via firewall 

      Root Cause Analysis:
      Heartbeat timestamps written into the lock file use System.nanoTime() from the leader JVM.

      Staleness detection compares that value with System.nanoTime() from a different JVM.

      JVMs have different nanoTime origins, so the subtraction yields meaningless and unpredictable values.

      This directly violates the JDK’s documented use of System.nanoTime() and explains the nondeterministic failover behavior.

              jnethert@redhat.com James Netherton
              rhn-support-mmillson Michael Millson
              Andrej Vano Andrej Vano
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: