Loading...

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: rhel-9.6
Component/s: pacemaker
Labels:
None

Regression:
No
Severity:
None

AssignedTeam:
rhel-ha

Story Points:
None
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
None
Sprint:
None

Preliminary Testing:
None
Test Coverage:
None

ProdDocsReview-CCS:
Unspecified
ProdDocsReview-Dev:
Unspecified
ProdDocsReview-QE:
Unspecified

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

We created a resource and started it on pacemakerHost05, but on pacemakerHost07, Pacemaker triggered a start action on the IDLE, but failed with: Resource 'pacemaker_idle_instanceUser_998_pacemakerHost07' not found (15 active resources)
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_lrm_rsc_op) notice: Requesting local execution of start operation for pacemaker_idle_instanceUser_998_pacemakerHost07 on pacemakerHost07 | transition_key=167:1296:0:89eff0d8-bc61-403e-ba05-d5f40f6f9b46 op_key=pacemaker_idle_instanceUser_998_pacemakerHost07_start_0
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-execd [22967] (process_lrmd_rsc_exec) info: Resource 'pacemaker_idle_instanceUser_998_pacemakerHost07' not found (15 active resources)
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_lrm_rsc_op) error: Could not initiate start action for resource pacemaker_idle_instanceUser_998_pacemakerHost07 locally: No such device | rc=19
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-execd [22967] (process_lrmd_get_rsc_info) info: Agent information for 'pacemaker_idle_instanceUser_998_pacemakerHost07' not in cache
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (process_lrm_event) error: Unable to record pacemaker_idle_instanceUser_998_pacemakerHost07_start_0 result in CIB: No resource information
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (log_executor_event) error: Result of start operation for pacemaker_idle_instanceUser_998_pacemakerHost07 on pacemakerHost07: Internal communication failure (No such device) | graph action unconfirmed; call=999999999 key=pacemaker_idle_instanceUser_998_pacemakerHost07_start_0
**

Once this error message appears, the Pacemaker controld goes down after cancelling

all resource actions
```
Aug 12 18:20:56.286 pacemakerHost07 pacemaker-controld [22970] (do_recover) warning: Fast-tracking shutdown in response to errors
...
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 56 for pacemaker_ethmonitor_instanceUser_eth3 (pacemaker_ethmonitor_instanceUser_eth3:56)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_ethmonitor_instanceUser_eth3_monitor_4000
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 96 for pacemaker_instancehost_instanceUser (pacemaker_instancehost_instanceUser:96)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_instancehost_instanceUser_monitor_10000
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (services_action_cancel) info: Terminating in-flight op pacemaker_instancehost_instanceUser_monitor_10000[61199] early because it was cancelled
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (async_action_complete) info: pacemaker_instancehost_instanceUser_monitor_10000[61199] terminated with signal 9 (Killed)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-execd [22967] (cancel_recurring_action) info: Cancelling ocf operation pacemaker_instancehost_instanceUser_monitor_10000
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-controld [22970] (stop_recurring_actions) info: Cancelling op 76 for pacemaker_ethmonitor_instanceUser_eth5 (pacemaker_ethmonitor_instanceUser_eth5:76)
Aug 12 18:20:56.290 pacemakerHost07 pacemaker-attrd [22968] (update_attr_on_host) notice: Setting last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0[pacemakerHost07] in instance_attributes: (unset) -> 1755040856 | from pacemakerHost06 with no write delay
...
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (lrmd_ipc_connection_destroy) info: Disconnected from local executor
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (pacemaker_cluster_disconnect) info: Disconnecting from corosync cluster layer
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (pacemaker__corosync_disconnect) notice: Disconnected from Corosync
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_ha_control) info: Disconnected from the cluster
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_cib_control) info: Waiting for resource update 95 to complete
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-controld [22970] (do_cib_control) info: Waiting for resource update 95 to complete
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: — 0.817.2 2
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: +++ 0.817.3 (null)
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: + /cib: @num_updates=3
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/transient_attributes[@id='3']/instance_attributes[@id='status-3']: <nvpair id="status-3-last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07.start_0" name="last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0" value="1755040856"/>
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_process_request) info: Completed cib_modify operation for section status: OK (rc=0, origin=local/client/2583, version=0.817.3)
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: — 0.817.2 2
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: Diff: +++ 0.817.4 (null)
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: + /cib: @num_updates=4
Aug 12 18:20:56.294 pacemakerHost07 pacemaker-based [22965] (cib_perform_op) info: ++ /cib/status/node_state[@id='3']/transient_attributes[@id='3']/instance_attributes[@id='status-3']: <nvpair id="status-3-last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07.start_0" name="last-failure-pacemaker_idle_instanceUser_998_pacemakerHost07#start_0" value="1755040856"/>
...
Aug 12 18:20:56.314 pacemakerHost07 pacemaker-controld [22970] (crmd_fast_exit) error: Could not recover from internal error
Aug 12 18:20:56.314 pacemakerHost07 pacemaker-controld [22970] (crm_exit) info: Exiting pacemaker-controld | with status 1
Aug 12 18:20:56.318 pacemakerHost07 pacemakerd [22953] (pacemaker_child_exit) error: pacemaker-controld[22970] exited with status 1 (Error occurred)
Aug 12 18:20:56.318 pacemakerHost07 pacemakerd [22953] (pacemaker__ipc_is_authentic_process_active) info: Could not connect to crmd IPC: Connection refused

```

What is the impact of this issue to you?

Failure to create Pacemaker resource
Please provide the package NVR for which the bug is seen:

2.1.9.1

How reproducible is this bug?:

intermittently

Steps to reproduce

Create a new resource
Pacemaker will report above mentioned error, and controld goes down
Controld respawns, but it does not attempt to start newly created resources

Expected results

Resource should go into the started state correctly on all hosts
Actual results

Resources did not get started

Details

Description

What were you trying to do that didn't work?

What is the impact of this issue to you?

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?:

Steps to reproduce

Expected results

Actual results

Attachments

Easy Agile Planning Poker

Activity

People

Dates