Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-101392

[RFE] db2: Suppress promote timeout in Pacemaker db2 RA when HADR standby is in REMOTE_CATCHUP_PENDING state and log replay is in progress

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • resource-agents
    • No
    • Moderate
    • Customer Escalated
    • rhel-ha
    • 3
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      Description:

      When the Pacemaker db2 resource agent (RA) initiates a promote operation on a standby node, it uses the timeout parameter to control how long it should wait for the promotion to succeed.

      However, during Db2 HADR failover scenarios, especially when the standby is in:

       

      HADR state: STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED

      …the takeover can appear to hang because Db2 is still performing log replay, which can take several minutes depending on the log gap and workload.

      Current Behavior:

      • The promote action is terminated after the configured timeout.
      • This results in unnecessary failover aborts, fencing, or retries, even though Db2 is still actively progressing through log replay during takeover.

      Requested Enhancement:

      Amend the promote action logic in the resource agent to:

      1. Check the HADR state (using db2pd -hadr or equivalent internal method).
      1. If the state is:

      STANDBY/REMOTE_CATCHUP_PENDING
      AND log replay is still progressing (as inferred from replay log LSN or known active state),

      1. THEN suppress or extend the timeout, allowing Db2 takeover to complete gracefully.

      Code Context:

      Relevant RA: resource-agents/heartbeat/db2

      Target function:
      Lines ~557–560, inside the promote operation logic.

       

      {{# promote action

      1. db2 takeover ... currently uses timeout}}

        Suggested Hook:

      Add a conditional wrapper like:

      ~~~

      if hadr_state == "REMOTE_CATCHUP_PENDING" && log_replay_active; then
          sleep + monitor replay progress
          continue waiting
      else
          proceed/exit on timeout
      fi

      ~~~

      This ensures that ongoing, valid log replay is not treated as a failure.

      Business Justification:

      • One of our customers (Account ID: 402911) is actively deploying pacemaker based db2-hadr has observed this behavior (support case 04177362).
      • The customer requires this enhancement to avoid false negatives during automatic failover scenarios.

       

              rhn-engineering-oalbrigt Oyvind Albrigtsen
              rh-ee-dmule Dhananjay Mule
              Oyvind Albrigtsen Oyvind Albrigtsen
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: