What were you trying to do that didn't work?
When one of the pacemaker sub daemons hangs ( in this case, pacemaker-attrd ), the Pacemaker tries five times to connect to the process, kills it, and respawns it. The problem we encountered is that there is a small timing hole between killing a process and respawning. If the pacemaker-controld tries to connect to a sub daemon that was killed and is in the process of respawning, the controld fails to connect to the daemon and takes that as a fatal error and shuts down the entire pacemaker stack.
What is the impact of this issue to you?
- Pacemaker encountered fatal error and shuts itself down and does not recover without manual intervention
Please provide the package NVR for which the bug is seen:
version 2.1.8-3.el9-3980678f0
How reproducible is this bug?:
difficult to reproduce, as it requires Pacemaker controld to interact with the killed sub-daemon before it respawns
Steps to reproduce
- Run kill -SIGSTOP one of the Pacemaker sub daemon in this example pacemaker-attrd
- Pacemaker logs attrd is unresponsive to ipc and respawns attrd
- After the attrd is killed and before it respawns controld connects to the attrd ( to update failure count etc, )
- pacemaker-controld fails to connect to attrd daemon and shuts down the entire Pacemaker stack
Expected results
The pacemaker will not try to interact with sub daemon that it just killed and in the process of respawning
Actual results
Pacemaker interacts with sub daemon it just killed and thus entire Pacemaker stack goes down