Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-83443

Unexpected node shutdown after rebooting of another host

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • rhel-9.4
    • pacemaker
    • None
    • No
    • None
    • rhel-ha
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      What were you trying to do that didn't work?  A cluster consists of 4 hosts: svtm501, svtm502, svtm503, svtm504.  svtm592 was the DC.  During the rebooting of host svtm504, Pacemaker was unexpectedly shutdown on another host svtm503.

      • User trigger a reboot of host svtm504 around Mar 11 04:24:17
      • From the DC (svtm502 Pacemaker log) At Mar 11 04:24:17, host svtm504 was detected down

      Mar 11 04:24:17.215 svtm502 pacemaker-controld  [30919] (pcmk__update_peer_expected)    info: handle_request: Node svtm504[4] - expected state is now down (was member)

      Mar 11 04:24:17.215 svtm502 pacemaker-controld  [30919] (handle_shutdown_request)       info: Creating shutdown request for svtm504 (state=S_TRANSITION_ENGINE)

      - At Mar 11 04:18: somehow node svtm503 was detected as shutting down.  THIS IS UNEXPECTED.

      Mar 11 04:24:18.271 svtm502 pacemaker-schedulerd[30918] (determine_online_status)       info: svtm503 is shutting down

      - One peculiar thing that was noted was that from svtm501 Pacemaker log file, the shutdown attribute failed to be set earlier when svtm503 was reboot:

      Mar 11 03:39:00.227 svtm501 pacemaker-attrd     [24774] (write_attribute)       notice: Cannot update shutdown[svtm503]='1741678740' now because node's UUID is unknown (will retry if learned)

      • And from svtm501 host, this shutdown attribute was later set at Mar 11 04:24:17.230, around the same time that Pacemaker is shutting down on node svtm503

      Mar 11 04:24:17.230 svtm501 pacemaker-attrd     [24774] (attrd_cib_callback)    info: CIB update 4156 result for shutdown: OK | rc=0
      Mar 11 04:24:17.230 svtm501 pacemaker-attrd     [24774] (attrd_cib_callback)    info: * Wrote shutdown[svtm504]=1741681457
      Mar 11 04:24:17.230 svtm501 pacemaker-attrd     [24774] (attrd_cib_callback)    info: * Wrote shutdown[svtm503]=1741678740

      I wondering whether the failed attempt/delay writing of the shutdown attribute for node svtm503 was causing the unexpected Pacemaker shutdown on svtm503.  If not, what else could have caused it ?

      We notice this started to occur in 2.1.9-1.

      What is the impact of this issue to you?  Unexpected cluster outage when a node is rebooted.

      Please provide the package NVR for which the bug is seen:

      Pacemaker 2.1.9-1

      How reproducible is this bug?:  So far only saw this once.

      Steps to reproduce

      1. Set up cluster of 4 nodes
      2. Reboot a node.  Wait for it to recover successfully after the host comes back online.
      3. Reboot another node.  Pacemaker is shutdown on the previous rebooted node.  So far, only saw this once.

      Expected results:  No unexpected Pacemaker shutdown on another node when rebooting of one node.

      Actual results: Pacemaker was shut down on another node.

              rhn-support-clumens Christopher Lumens
              lpham@ca.ibm.com Lan Pham
              IBM Confidential Group
              Christopher Lumens Christopher Lumens
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: