Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-109543

Internal communication error during resource create causes controld to shutdown and fails resource start

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • rhel-9.6
    • pacemaker
    • None
    • No
    • None
    • rhel-ha
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • None

      What were you trying to do that didn't work?

      • We created a resource and started it on pacemakerHost05, but on pacemakerHost07, Pacemaker triggered a start action on the IDLE, but failed with: Resource 'pacemaker_idle_instanceUser_998_pacemakerHost07' not found (15 active resources)
        Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_lrm_rsc_op) notice: Requesting local execution of start operation for pacemaker_idle_instanceUser_998_pacemakerHost07 on pacemakerHost07 | transition_key=167:1296:0:89eff0d8-bc61-403e-ba05-d5f40f6f9b46 op_key=pacemaker_idle_instanceUser_998_pacemakerHost07_start_0
        Aug 12 18:20:56.286 pacemakerHost07 pacemaker-execd [22967] (process_lrmd_rsc_exec) info: Resource 'pacemaker_idle_instanceUser_998_pacemakerHost07' not found (15 active resources)
        Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_lrm_rsc_op) error: Could not initiate start action for resource pacemaker_idle_instanceUser_998_pacemakerHost07 locally: No such device | rc=19
        Aug 12 18:20:56.286 pacemakerHost07 pacemaker-execd [22967] (process_lrmd_get_rsc_info) info: Agent information for 'pacemaker_idle_instanceUser_998_pacemakerHost07' not in cache
        Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (process_lrm_event) error: Unable to record pacemaker_idle_instanceUser_998_pacemakerHost07_start_0 result in CIB: No resource information
        Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (log_executor_event) error: Result of start operation for pacemaker_idle_instanceUser_998_pacemakerHost07 on pacemakerHost07: Internal communication failure (No such device) | graph action unconfirmed; call=999999999 key=pacemaker_idle_instanceUser_998_pacemakerHost07_start_0
        **

      Once this error message appears, the Pacemaker controld goes down after cancelling

      all resource actions
      ```
      Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_recover) warning: Fast-tracking shutdown in response to errors
      ...
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 56 for pacemaker_ethmonitor_instanceUser_eth3 (pacemaker_ethmonitor_instanceUser_eth3:56)
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_ethmonitor_instanceUser_eth3_monitor_4000
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 96 for pacemaker_instancehost_instanceUser (pacemaker_instancehost_instanceUser:96)
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_instancehost_instanceUser_monitor_10000
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (services_action_cancel) info: Terminating in-flight op pacemaker_instancehost_instanceUser_monitor_10000[61199] early because it was cancelled
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (async_action_complete) info: pacemaker_instancehost_instanceUser_monitor_10000[61199] terminated with signal 9 (Killed)
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_instancehost_instanceUser_monitor_10000
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 76 for pacemaker_ethmonitor_instanceUser_eth5 (pacemaker_ethmonitor_instanceUser_eth5:76)
      Aug 12 18:20:56.290 pacemakerHost07 pacemaker-attrd [22968] (update_attr_on_host) notice: Setting last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0[pacemakerHost07] in instance_attributes: (unset) -> 1755040856 | from pacemakerHost06 with no write delay
      ...
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (lrmd_ipc_connection_destroy) info: Disconnected from local executor
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (pacemaker_cluster_disconnect) info: Disconnecting from corosync cluster layer
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (pacemaker__corosync_disconnect) notice: Disconnected from Corosync
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_ha_control) info: Disconnected from the cluster
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_cib_control) info: Waiting for resource update 95 to complete
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_cib_control) info: Waiting for resource update 95 to complete
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: — 0.817.2 2
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: +++ 0.817.3 (null)
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: + /cib: @num_updates=3
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/transient_attributes[@id='3']/instance_attributes[@id='status-3']: <nvpair id="status-3-last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07.start_0" name="last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0" value="1755040856"/>
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=local/client/2583, version=0.817.3)
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: — 0.817.2 2
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: +++ 0.817.4 (null)
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: + /cib: @num_updates=4
      Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/transient_attributes[@id='3']/instance_attributes[@id='status-3']: <nvpair id="status-3-last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07.start_0" name="last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0" value="1755040856"/>
      ...
      Aug 12 18:20:56.314 pacemakerHost07 pacemaker-controld [22970] (crmd_fast_exit) error: Could not recover from internal error
      Aug 12 18:20:56.314 pacemakerHost07 pacemaker-controld [22970] (crm_exit) info: Exiting pacemaker-controld | with status 1
      Aug 12 18:20:56.318 pacemakerHost07 pacemakerd [22953] (pacemaker_child_exit) error: pacemaker-controld[22970] exited with status 1 (Error occurred)
      Aug 12 18:20:56.318 pacemakerHost07 pacemakerd [22953] (pacemaker__ipc_is_authentic_process_active) info: Could not connect to crmd IPC: Connection refused

      ```

      What is the impact of this issue to you?

      • Failure to create Pacemaker resource

        Please provide the package NVR for which the bug is seen:

      2.1.9.1

      How reproducible is this bug?:

      intermittently

      Steps to reproduce

      1. Create a new resource
      2. Pacemaker will report above mentioned error, and controld goes down
      3. Controld respawns, but it does not attempt to start newly created resources

      Expected results

      • Resource should go into the started state correctly on all hosts

        Actual results

      • Resources did not get started

              rhn-support-clumens Christopher Lumens
              donghohan@ibm.com Dongho Han
              Christopher Lumens Christopher Lumens
              Cluster QE Cluster QE
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: