-
Bug
-
Resolution: Unresolved
-
Critical
-
rhel-9.4
-
fence-agents-4.10.0-95.el9
-
No
-
Important
-
ZStream
-
rhel-ha
-
19
-
23
-
8
-
False
-
False
-
-
Yes
-
None
-
Approved Blocker
-
Pass
-
Manual
-
Bug Fix
-
-
Done
-
Done
-
Done
-
Not Required
-
-
x86_64
-
-
None
What were you trying to do that didn't work?
A hung Virtual Machine, that has a kernel panic, BSOD etc, will usually not respond to any shutdown request (Guest Agent or ACPI), and will stay in that state. A very busy Virtual Machine or having contention issues may also take a longer time to shutdown.
What fence_kubvirt is doing for powering off is sending Kubevirt a simple stop request, which makes Kubevirt try to gracefully shutdown the VM, waiting until terminationGracePeriodSeconds passes before it finally kills/powers off the VM.
The problem is that terminationGracePeriodSeconds defaults to 180s for Linux VMs, and even longer for Windows (1 hour). So hung VMs will fail to power off in the eyes of fence_kubevirt if they take more than the fence-kubevirt timeout of 40s to shutdown. Which will always happen unless the customer has a custom terminationGracePeriodSeconds.
See, this on a paniced RHEL 7.9 guest:
# /usr/sbin/fence_kubevirt --kubeconfig=/root/config --namespace=homelab -n rhel-79 -o off -vvv 2025-03-04 12:31:20,942 DEBUG: Starting get status operation 2025-03-04 12:31:20,952 DEBUG: Starting set status operation 2025-03-04 12:31:20,974 DEBUG: Starting get status operation 2025-03-04 12:31:21,980 DEBUG: Starting get status operation 2025-03-04 12:31:22,987 DEBUG: Starting get status operation 2025-03-04 12:31:23,993 DEBUG: Starting get status operation 2025-03-04 12:31:25,001 DEBUG: Starting get status operation 2025-03-04 12:31:26,007 DEBUG: Starting get status operation 2025-03-04 12:31:27,014 DEBUG: Starting get status operation 2025-03-04 12:31:28,021 DEBUG: Starting get status operation 2025-03-04 12:31:29,026 DEBUG: Starting get status operation 2025-03-04 12:31:30,033 DEBUG: Starting get status operation 2025-03-04 12:31:31,040 DEBUG: Starting get status operation 2025-03-04 12:31:32,046 DEBUG: Starting get status operation 2025-03-04 12:31:33,053 DEBUG: Starting get status operation 2025-03-04 12:31:34,062 DEBUG: Starting get status operation 2025-03-04 12:31:35,069 DEBUG: Starting get status operation 2025-03-04 12:31:36,076 DEBUG: Starting get status operation 2025-03-04 12:31:37,082 DEBUG: Starting get status operation 2025-03-04 12:31:38,089 DEBUG: Starting get status operation 2025-03-04 12:31:39,096 DEBUG: Starting get status operation 2025-03-04 12:31:40,101 DEBUG: Starting get status operation 2025-03-04 12:31:41,107 DEBUG: Starting get status operation 2025-03-04 12:31:42,114 DEBUG: Starting get status operation 2025-03-04 12:31:43,121 DEBUG: Starting get status operation 2025-03-04 12:31:44,127 DEBUG: Starting get status operation 2025-03-04 12:31:45,143 DEBUG: Starting get status operation 2025-03-04 12:31:46,151 DEBUG: Starting get status operation 2025-03-04 12:31:47,159 DEBUG: Starting get status operation 2025-03-04 12:31:48,171 DEBUG: Starting get status operation 2025-03-04 12:31:49,178 DEBUG: Starting get status operation 2025-03-04 12:31:50,184 DEBUG: Starting get status operation 2025-03-04 12:31:51,190 DEBUG: Starting get status operation 2025-03-04 12:31:52,197 DEBUG: Starting get status operation 2025-03-04 12:31:53,204 DEBUG: Starting get status operation 2025-03-04 12:31:54,211 DEBUG: Starting get status operation 2025-03-04 12:31:55,217 DEBUG: Starting get status operation 2025-03-04 12:31:56,223 DEBUG: Starting get status operation 2025-03-04 12:31:57,229 DEBUG: Starting get status operation 2025-03-04 12:31:58,236 DEBUG: Starting get status operation 2025-03-04 12:31:59,243 DEBUG: Starting get status operation 2025-03-04 12:32:00,251 DEBUG: Starting get status operation 2025-03-04 12:32:01,257 ERROR: Failed: Timed out waiting to power OFF
140s later Kubevirt finally kills the VM, but fence_kubevirt already gave up long ago.
And a reboot fails the same way, the VM eventually stops and then stays off
For VMs that are hung its not hard to exceed the fence-kubevirt timeout.
I'm not sure, but a fencing operation is usually a hard power off - ungraceful shutdown.
Not waiting for the Guest to gracefully stop. Most of the times this is needed the VM would be malfunctioning in some way.
Perhaps the timeouts need to be checked, or it should issue a force stop to Kubevirt instead of a graceful one.
What is the impact of this issue to you?
Fencing fails depending if the VM is hung.
Please provide the package NVR for which the bug is seen:
fence-agents-kubevirt-4.10.0-62.el9_4.10.x86_64
How reproducible is this bug?:
Always
Steps to reproduce
1. Kernel panic a VM without rebooting
2. Use fence_kubevirt to reboot it
Expected results
VM rebooted immediately
Actual results
VM takes ages to power off (terminationGracePeriodSeconds) and stays off.
- depends on
-
CNV-64608 Cannot enforce stopping a VM
-
- Verified
-
- is blocked by
-
CNV-64608 Cannot enforce stopping a VM
-
- Verified
-
- links to
-
RHSA-2025:147364 fence-agents security update