-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
rhel-9.6
-
None
-
No
-
None
-
rhel-ha
-
None
-
False
-
False
-
-
None
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
None
What were you trying to do that didn't work?
- We created a resource and started it on pacemakerHost05, but on pacemakerHost07, Pacemaker triggered a start action on the IDLE, but failed with: Resource 'pacemaker_idle_instanceUser_998_pacemakerHost07' not found (15 active resources)
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_lrm_rsc_op) notice: Requesting local execution of start operation for pacemaker_idle_instanceUser_998_pacemakerHost07 on pacemakerHost07 | transition_key=167:1296:0:89eff0d8-bc61-403e-ba05-d5f40f6f9b46 op_key=pacemaker_idle_instanceUser_998_pacemakerHost07_start_0
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-execd [22967] (process_lrmd_rsc_exec) info: Resource 'pacemaker_idle_instanceUser_998_pacemakerHost07' not found (15 active resources)
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_lrm_rsc_op) error: Could not initiate start action for resource pacemaker_idle_instanceUser_998_pacemakerHost07 locally: No such device | rc=19
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-execd [22967] (process_lrmd_get_rsc_info) info: Agent information for 'pacemaker_idle_instanceUser_998_pacemakerHost07' not in cache
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (process_lrm_event) error: Unable to record pacemaker_idle_instanceUser_998_pacemakerHost07_start_0 result in CIB: No resource information
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (log_executor_event) error: Result of start operation for pacemaker_idle_instanceUser_998_pacemakerHost07 on pacemakerHost07: Internal communication failure (No such device) | graph action unconfirmed; call=999999999 key=pacemaker_idle_instanceUser_998_pacemakerHost07_start_0
**
Once this error message appears, the Pacemaker controld goes down after cancelling
all resource actions
```
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_recover) warning: Fast-tracking shutdown in response to errors
...
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 56 for pacemaker_ethmonitor_instanceUser_eth3 (pacemaker_ethmonitor_instanceUser_eth3:56)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_ethmonitor_instanceUser_eth3_monitor_4000
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 96 for pacemaker_instancehost_instanceUser (pacemaker_instancehost_instanceUser:96)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_instancehost_instanceUser_monitor_10000
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (services_action_cancel) info: Terminating in-flight op pacemaker_instancehost_instanceUser_monitor_10000[61199] early because it was cancelled
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (async_action_complete) info: pacemaker_instancehost_instanceUser_monitor_10000[61199] terminated with signal 9 (Killed)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_instancehost_instanceUser_monitor_10000
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 76 for pacemaker_ethmonitor_instanceUser_eth5 (pacemaker_ethmonitor_instanceUser_eth5:76)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-attrd [22968] (update_attr_on_host) notice: Setting last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0[pacemakerHost07] in instance_attributes: (unset) -> 1755040856 | from pacemakerHost06 with no write delay
...
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (lrmd_ipc_connection_destroy) info: Disconnected from local executor
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (pacemaker_cluster_disconnect) info: Disconnecting from corosync cluster layer
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (pacemaker__corosync_disconnect) notice: Disconnected from Corosync
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_ha_control) info: Disconnected from the cluster
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_cib_control) info: Waiting for resource update 95 to complete
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_cib_control) info: Waiting for resource update 95 to complete
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: — 0.817.2 2
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: +++ 0.817.3 (null)
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: + /cib: @num_updates=3
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/transient_attributes[@id='3']/instance_attributes[@id='status-3']: <nvpair id="status-3-last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07.start_0" name="last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0" value="1755040856"/>
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=local/client/2583, version=0.817.3)
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: — 0.817.2 2
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: +++ 0.817.4 (null)
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: + /cib: @num_updates=4
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/transient_attributes[@id='3']/instance_attributes[@id='status-3']: <nvpair id="status-3-last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07.start_0" name="last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0" value="1755040856"/>
...
Aug 12 18:20:56.314 pacemakerHost07 pacemaker-controld [22970] (crmd_fast_exit) error: Could not recover from internal error
Aug 12 18:20:56.314 pacemakerHost07 pacemaker-controld [22970] (crm_exit) info: Exiting pacemaker-controld | with status 1
Aug 12 18:20:56.318 pacemakerHost07 pacemakerd [22953] (pacemaker_child_exit) error: pacemaker-controld[22970] exited with status 1 (Error occurred)
Aug 12 18:20:56.318 pacemakerHost07 pacemakerd [22953] (pacemaker__ipc_is_authentic_process_active) info: Could not connect to crmd IPC: Connection refused
```
What is the impact of this issue to you?
2.1.9.1
How reproducible is this bug?:
intermittently
Steps to reproduce
- Create a new resource
- Pacemaker will report above mentioned error, and controld goes down
- Controld respawns, but it does not attempt to start newly created resources
Expected results
- Resources did not get started