Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: rhos-17.1.z
Affects Version/s: rhos-17.1.11
Component/s: openstack-tripleo-heat-templates
Labels:
None

Story Points:
3
Epic Link:
[BugEpic]: TripleO may bring down pacemaker in InstanceHA overclouds via suboptimal puppet-pacemaker calls
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
openstack-tripleo-heat-templates-14.3.1-17.1.20260227100907.e7c7ce3.el9osttrunk
AssignedTeam:
rhos-ops-platform-services-pidone
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Critical

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:
I am not sure if this is reproducible reliably, or a one time situation that may not come today in close future. Nonetheless, it looks like RHOSP InstanceHA environments with huge number of computes are vulnerable because TripleO may trigger pacemaker DoS when provisioning fencing devices on compute nodes.

We have a very good data set attached to the case including:

output of deployment command
sosreports from director, all controllers and two affected computes
overcloud-deploy folder
DB dumps

From provided dataset I can see that:

deployment command was running fine until step 5 execution on compute nodes;
compute nodes were fenced in great numbers by controllers after pacemaker cluster on controller nodes reported different groups of errors [1];
on compute nodes I can see that some ansible task was running. Small subset of computes managed to report failed pcs command [2]

So it looks like some pacemaker calls were executed simultaneously on InstanceHA computes by TripleO and triggered DoS situation for pacemaker.

IMO this is a bug with very big potential impact.

[1]

Jan 21 12:51:01 CONTROLLER pacemakerd[250747]: notice: pacemaker-based[250748] is unresponsive to ipc after 1 tries

Jan 21 12:51:35 CONTROLLER pacemaker-controld[250753]: error: Result of monitor operation for rabbitmq on rabbitmq-bundle-2: Timed Out after 40s (Resource agent did not complete within 40s)

Jan 21 12:51:38 CONTROLLER pacemaker-fenced[250749]: warning: Client with process ID 250753 has a backlog of 1020 messages

[2]

Jan 21 12:52:31 COMPUTE puppet-user[538813]: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Fence_ipmilan[MAC]/Pcmk_stonith[stonith-fence_ipmilan-dd]: Could not evaluate: pcs -f  constraint location | grep stonith-fence_ipmilan-dd > /dev/null 2>&1 failed: . Too many tries

Expected behavior
There should be some form of jitter, or centralized sequential execution, not bulk calls against pacemaker.

Bug impact
Outage for business

Known workaround
Tuning update_serial will help, but it is not a solution that will prevent this from happening for unaware customers.

Assignee:: Luca Miccini

Reporter:: Alex Stupnikov

Team:: rhos-dfg-pidone

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2026/01/30 4:43 PM

Updated:: 2026/02/27 12:01 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty