-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
rhos-17.1.11
-
None
-
3
-
False
-
-
False
-
?
-
rhos-ops-platform-services-pidone
-
None
-
-
-
-
Critical
To Reproduce Steps to reproduce the behavior:
I am not sure if this is reproducible reliably, or a one time situation that may not come today in close future. Nonetheless, it looks like RHOSP InstanceHA environments with huge number of computes are vulnerable because TripleO may trigger pacemaker DoS when provisioning fencing devices on compute nodes.
We have a very good data set attached to the case including:
- output of deployment command
- sosreports from director, all controllers and two affected computes
- overcloud-deploy folder
- DB dumps
From provided dataset I can see that:
- deployment command was running fine until step 5 execution on compute nodes;
- compute nodes were fenced in great numbers by controllers after pacemaker cluster on controller nodes reported different groups of errors [1];
- on compute nodes I can see that some ansible task was running. Small subset of computes managed to report failed pcs command [2]
So it looks like some pacemaker calls were executed simultaneously on InstanceHA computes by TripleO and triggered DoS situation for pacemaker.
IMO this is a bug with very big potential impact.
[1]
Jan 21 12:51:01 CONTROLLER pacemakerd[250747]: notice: pacemaker-based[250748] is unresponsive to ipc after 1 tries
Jan 21 12:51:35 CONTROLLER pacemaker-controld[250753]: error: Result of monitor operation for rabbitmq on rabbitmq-bundle-2: Timed Out after 40s (Resource agent did not complete within 40s)
Jan 21 12:51:38 CONTROLLER pacemaker-fenced[250749]: warning: Client with process ID 250753 has a backlog of 1020 messages
[2]
Jan 21 12:52:31 COMPUTE puppet-user[538813]: Error: /Stage[main]/Tripleo::Fencing/Pacemaker::Stonith::Fence_ipmilan[MAC]/Pcmk_stonith[stonith-fence_ipmilan-dd]: Could not evaluate: pcs -f constraint location | grep stonith-fence_ipmilan-dd > /dev/null 2>&1 failed: . Too many tries
Expected behavior
There should be some form of jitter, or centralized sequential execution, not bulk calls against pacemaker.
Bug impact
Outage for business
Known workaround
Tuning update_serial will help, but it is not a solution that will prevent this from happening for unaware customers.