We have identified that the `pcmk_monitor_timeout` default value for stonith devices reports a default which is not accurate in all of our documentation, and man pages. The default is listed as 60s ( based on `stonith-timeout`, but since `pcmk_monitor_timeout` isn't actually applied unless explicitly set, this value would not be very accurate. The actual monitor timeout by default would be 20s, so we should update this in documentation and man pages ( upstream and in RHEL ):
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/s1-fencedevicesadditional-haar
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configu[...]-fencing-configuring-and-managing-high-availability-clusters
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/configu[...]-fencing-configuring-and-managing-high-availability-clusters
$ man pacemaker-fenced --------------------------------8<----------------------------- pcmk_monitor_timeout = time [60s] Advanced use only: Specify an alternate timeout to use for monitor actions instead of stonith-timeout Some devices need much more/less time to complete than normal. Use this to specify an alternate, device-specific, timeout for 'monitor' actions.
Discussion in Slack around issue:
https://redhat-internal.slack.com/archives/C04HH4AJYH4/p1710789736264799
After discussion with Kgalliot and engineering, below are the tasks we wish to complete with this bug:
(1) figure out how the fencing monitor timeouts currently work
(2) decide and implement how they should be defined and used
(3) update the upstream documentation appropriately. They are also in the pacemaker-fenced man page, which would need updates as well.
(4) update the RHEL documentation.
- For official documentation updates, we have the below DOC request opened:
[RHELDOCS-17816] Update documentation for pcmk_monitor_timeout
https://issues.redhat.com/browse/RHELDOCS-17816
This issue would additionally be an extension of issues being reviewed in below BUG:
RHEL-14826 A stop action for a stonith device timed out leading to a cluster node
being fenced
https://issues.redhat.com/browse/RHEL-14826
- is related to
-
RHEL-14826 A stop action for a stonith device timed out leading to a cluster node being fenced
- In Progress
- links to