-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
No
-
Moderate
-
Customer Escalated
-
rhel-ha
-
3
-
False
-
False
-
-
None
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
None
Description:
When the Pacemaker db2 resource agent (RA) initiates a promote operation on a standby node, it uses the timeout parameter to control how long it should wait for the promotion to succeed.
However, during Db2 HADR failover scenarios, especially when the standby is in:
HADR state: STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED
…the takeover can appear to hang because Db2 is still performing log replay, which can take several minutes depending on the log gap and workload.
Current Behavior:
- The promote action is terminated after the configured timeout.
- This results in unnecessary failover aborts, fencing, or retries, even though Db2 is still actively progressing through log replay during takeover.
Requested Enhancement:
Amend the promote action logic in the resource agent to:
- Check the HADR state (using db2pd -hadr or equivalent internal method).
- If the state is:
STANDBY/REMOTE_CATCHUP_PENDING
AND log replay is still progressing (as inferred from replay log LSN or known active state),
- THEN suppress or extend the timeout, allowing Db2 takeover to complete gracefully.
Code Context:
Relevant RA: resource-agents/heartbeat/db2
Target function:
Lines ~557–560, inside the promote operation logic.
{{# promote action
Add a conditional wrapper like:
~~~
if hadr_state == "REMOTE_CATCHUP_PENDING" && log_replay_active; then
sleep + monitor replay progress
continue waiting
else
proceed/exit on timeout
fi
~~~
This ensures that ongoing, valid log replay is not treated as a failure.
Business Justification:
- One of our customers (Account ID: 402911) is actively deploying pacemaker based db2-hadr has observed this behavior (support case 04177362).
- The customer requires this enhancement to avoid false negatives during automatic failover scenarios.