Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-87618

Pacemaker shutdown unexpectedly during resource clean up

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhel-9.4.z
    • pacemaker
    • None
    • No
    • None
    • rhel-ha
    • 13
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • x86_64
    • None

      What were you trying to do that didn't work?  During the remove host operation in which we remove resource definitions and constraints defined on a host, Pacemaker hit an error and unexpected shut itself down.  Here are the errors around the shutdown events:

      Apr 15 18:58:29.866 svtlnxps05 pacemaker-controld  [53865] (do_lrm_rsc_op)      error: Could not initiate start action for resource db2_instancehost_jstamko2 locally: No such device | rc=19

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-execd     [53862] (process_lrmd_get_rsc_info)  info: Agent information for 'db2_instancehost_jstamko2' not in cache

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-controld  [53865] (process_lrm_event)  error: Unable to record db2_instancehost_jstamko2_start_0 result in CIB: No resource information

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-controld  [53865] (log_executor_event)         error: Result of start operation for db2_instancehost_jstamko2 on svtlnxps05: Internal communication failure (No such device) | graph action unconfirmed; call=999999999 key=db2_instancehost_jstamko2_start_0

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-controld  [53865] (register_fsa_error_adv)     info: Resetting the current action list

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-controld  [53865] (do_log)     warning: Input I_FAIL received in state S_NOT_DC from do_lrm_rsc_op

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-controld  [53865] (do_state_transition)        notice: State transition S_NOT_DC -> S_RECOVERY | input=I_FAIL cause=C_FSA_INTERNAL origin=do_lrm_rsc_op

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-controld  [53865] (do_recover)         warning: Fast-tracking shutdown in response to errors

      Apr 15 18:58:29.870 svtlnxps05 pacemaker-controld  [53865] (do_log)     error: Input I_TERMINATE received in state S_RECOVERY from do_recover

      Apr 15 18:58:29.878 svtlnxps05 pacemaker-controld  [53865] (crmd_fast_exit)     error: Could not recover from internal error

      Apr 15 18:58:29.882 svtlnxps05 pacemaker-controld  [53865] (crm_exit)   info: Exiting pacemaker-controld | with status 1

      Apr 15 18:58:29.882 svtlnxps05 pacemakerd          [53853] (pcmk_child_exit)    error: pacemaker-controld[53865] exited with status 1 (Error occurred)

      What is the impact of this issue to you?  The drop node operation failed.

      Please provide the package NVR for which the bug is seen: 2.1.9-1

      How reproducible is this bug?: Not sure, only hit once so far.

      Steps to reproduce

      1. Set up a cluster consists of 4 hosts
      2. From one host, run operations that would delete resource definition and remove resource constraints on another host
      3. When the error is hit, Pacemaker would shutdown and restarted itself back.  However, the node was fenced during the process and resulted in unexpected behaviour

      Expected results: No error expected when cleaning up resources on a host

      Actual results: Pacemaker unexpectedly shutdown and restart

              rhn-engineering-cfeist Chris Feist
              kwonmin.bok@ibm.com Kwonmin Bok (Inactive)
              Chris Feist Chris Feist
              HA Sustaining HA Sustaining
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: