Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: QP-3.20.4.SP2
Affects Version/s: 3.20.0.GA
Component/s: Camel
Labels:
- ts/tnb

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
GSS Priority:
Steps to Reproduce:

Hide

Deploy two Camel Quarkus instances in separate containers (each its own JVM) using ClusteredRoutePolicyFactory with the file lock cluster service.

Ensure both access the same lock file on the host.

Allow one node to become leader.

Block filesystem access to the lock file for the leader using a firewall rule.

Leader shuts down as expected.

Observe whether the standby instance becomes leader.

Actual Results:
Failover is inconsistent:
Sometimes the standby JVM becomes leader.
Sometimes it does not.
Behavior varies between test runs without configuration changes.

Expected Results:
Failover should be deterministic and reliable.
Standby instance should always become the leader when the active leader stops updating the lock file.

Show
Deploy two Camel Quarkus instances in separate containers (each its own JVM) using ClusteredRoutePolicyFactory with the file lock cluster service. Ensure both access the same lock file on the host. Allow one node to become leader. Block filesystem access to the lock file for the leader using a firewall rule. Leader shuts down as expected. Observe whether the standby instance becomes leader. Actual Results: Failover is inconsistent: Sometimes the standby JVM becomes leader. Sometimes it does not. Behavior varies between test runs without configuration changes. Expected Results: Failover should be deterministic and reliable. Standby instance should always become the leader when the active leader stops updating the lock file.
Intelligence Requested:
Market:

Risk Score:
0
Customer Impact:

Customer Reported

Test Coverage:

+

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Failover in ClusteredRoutePolicyFactory using the FileLockClusterService is unreliable when running multiple JVMs. During disaster-recovery testing, the standby Camel instance does not consistently take over leadership after the active node loses access to the lock file. We believe this is caused by using System.nanoTime() across JVM boundaries, which the JDK specifies is not valid.

We use Camel Quarkus with ClusteredRoutePolicyFactory and FileLockClusterService across two containerized JVMs. During disaster recovery testing, we simulate a failure by blocking access from the active Camel instance to the shared lock file. The active instance correctly shuts down, but the standby instance does not always assume leadership. The behavior is nondeterministic and varies across test runs.

After investigation, we found that leader staleness detection uses:

final long elapsed = currentNanoTime - previousObservedHeartbeat;

Where:

currentNanoTime is obtained from System.nanoTime() in the standby JVM

previousObservedHeartbeat is the heartbeat timestamp written using System.nanoTime() from the former leader JVM

However, according to the System.nanoTime() Javadoc:

“The same origin is used by all invocations of this method in an instance of a Java virtual machine; other virtual machine instances are likely to use a different origin. This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.”

This means System.nanoTime() values cannot be compared between JVMs, making the calculation of elapsed invalid and causing nondeterministic failover behavior.

Environment:
Red Hat Camel Quarkus
Camel ClusteredRoutePolicyFactory
FileLockClusterService
Two containerized JVMs (Podman) sharing a lock file

Disaster recovery test: block access to the lock file on one node via firewall

Root Cause Analysis:
Heartbeat timestamps written into the lock file use System.nanoTime() from the leader JVM.

Staleness detection compares that value with System.nanoTime() from a different JVM.

JVMs have different nanoTime origins, so the subtraction yields meaningless and unpredictable values.

This directly violates the JDK’s documented use of System.nanoTime() and explains the nondeterministic failover behavior.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

3_20_ceq-12238_23_12.zip
12.36 MB
2025/12/23 2:33 PM
camel-file-4.10.3-SNAPSHOT.zip
909 kB
2025/12/17 3:35 PM

links to

RHEA-2026:157920 Red Hat Build of Apache Camel 4.10 for Quarkus 3.20 update is now available (RHBQ 3.20.4.SP2)

mentioned on

Merge request - CEQ-12238: Upgrade Camel Quarkus to 3.20.0.redhat-00011 and Quarkus CXF to 3.20.2.redhat-00012

Merge request - Quarkus failover

Assignee:: James Netherton

Reporter:: Michael Millson

QA Contact:: Andrej Vano

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Created:: 2025/12/03 1:29 PM

Updated:: 2026/01/21 11:11 AM

Resolved:: 2026/01/08 11:32 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates