Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-82193

fence_kubevirt uses graceful shutdown to power off/reboot VM, times out and fails.

Linking RHIVOS CVEs to...Migration: Automation ...Sync from "Extern...XMLWordPrintable

    • fence-agents-4.10.0-95.el9
    • No
    • Important
    • ZStream
    • rhel-ha
    • 19
    • 23
    • 8
    • False
    • False
    • Hide

      None

      Show
      None
    • Yes
    • None
    • Approved Blocker
    • Bug Fix
    • Hide
      .`fence_kubevirt` powers off nodes instantly

      Before this update, the `fence_kubevirt` agent performed a graceful shutdown of the node. This introduced a delay in the fencing process, as the node was not powered off immediately.

      With this release, the agent has been modified to request an immediate, non-graceful shutdown.

      As a result, when using the `fence_kubevirt` agent, nodes are instantly powered off.
      Show
      .`fence_kubevirt` powers off nodes instantly Before this update, the `fence_kubevirt` agent performed a graceful shutdown of the node. This introduced a delay in the fencing process, as the node was not powered off immediately. With this release, the agent has been modified to request an immediate, non-graceful shutdown. As a result, when using the `fence_kubevirt` agent, nodes are instantly powered off.
    • Done
    • Done
    • Done
    • Not Required
    • x86_64
    • None

      What were you trying to do that didn't work?

      A hung Virtual Machine, that has a kernel panic, BSOD etc, will usually not respond to any shutdown request (Guest Agent or ACPI), and will stay in that state. A very busy Virtual Machine or having contention issues may also take a longer time to shutdown.

      What fence_kubvirt is doing for powering off is sending Kubevirt a simple stop request, which makes Kubevirt try to gracefully shutdown the VM, waiting until terminationGracePeriodSeconds passes before it finally kills/powers off the VM.

      The problem is that terminationGracePeriodSeconds defaults to 180s for Linux VMs, and even longer for Windows (1 hour). So hung VMs will fail to power off in the eyes of fence_kubevirt if they take more than the fence-kubevirt timeout of 40s to shutdown. Which will always happen unless the customer has a custom terminationGracePeriodSeconds.

      See, this on a paniced RHEL 7.9 guest:

       

      # /usr/sbin/fence_kubevirt --kubeconfig=/root/config  --namespace=homelab -n rhel-79 -o off -vvv
      2025-03-04 12:31:20,942 DEBUG: Starting get status operation
      2025-03-04 12:31:20,952 DEBUG: Starting set status operation
      2025-03-04 12:31:20,974 DEBUG: Starting get status operation
      2025-03-04 12:31:21,980 DEBUG: Starting get status operation
      2025-03-04 12:31:22,987 DEBUG: Starting get status operation
      2025-03-04 12:31:23,993 DEBUG: Starting get status operation
      2025-03-04 12:31:25,001 DEBUG: Starting get status operation
      2025-03-04 12:31:26,007 DEBUG: Starting get status operation
      2025-03-04 12:31:27,014 DEBUG: Starting get status operation
      2025-03-04 12:31:28,021 DEBUG: Starting get status operation
      2025-03-04 12:31:29,026 DEBUG: Starting get status operation
      2025-03-04 12:31:30,033 DEBUG: Starting get status operation
      2025-03-04 12:31:31,040 DEBUG: Starting get status operation
      2025-03-04 12:31:32,046 DEBUG: Starting get status operation
      2025-03-04 12:31:33,053 DEBUG: Starting get status operation
      2025-03-04 12:31:34,062 DEBUG: Starting get status operation
      2025-03-04 12:31:35,069 DEBUG: Starting get status operation
      2025-03-04 12:31:36,076 DEBUG: Starting get status operation
      2025-03-04 12:31:37,082 DEBUG: Starting get status operation
      2025-03-04 12:31:38,089 DEBUG: Starting get status operation
      2025-03-04 12:31:39,096 DEBUG: Starting get status operation
      2025-03-04 12:31:40,101 DEBUG: Starting get status operation
      2025-03-04 12:31:41,107 DEBUG: Starting get status operation
      2025-03-04 12:31:42,114 DEBUG: Starting get status operation
      2025-03-04 12:31:43,121 DEBUG: Starting get status operation
      2025-03-04 12:31:44,127 DEBUG: Starting get status operation
      2025-03-04 12:31:45,143 DEBUG: Starting get status operation
      2025-03-04 12:31:46,151 DEBUG: Starting get status operation
      2025-03-04 12:31:47,159 DEBUG: Starting get status operation
      2025-03-04 12:31:48,171 DEBUG: Starting get status operation
      2025-03-04 12:31:49,178 DEBUG: Starting get status operation
      2025-03-04 12:31:50,184 DEBUG: Starting get status operation
      2025-03-04 12:31:51,190 DEBUG: Starting get status operation
      2025-03-04 12:31:52,197 DEBUG: Starting get status operation
      2025-03-04 12:31:53,204 DEBUG: Starting get status operation
      2025-03-04 12:31:54,211 DEBUG: Starting get status operation
      2025-03-04 12:31:55,217 DEBUG: Starting get status operation
      2025-03-04 12:31:56,223 DEBUG: Starting get status operation
      2025-03-04 12:31:57,229 DEBUG: Starting get status operation
      2025-03-04 12:31:58,236 DEBUG: Starting get status operation
      2025-03-04 12:31:59,243 DEBUG: Starting get status operation
      2025-03-04 12:32:00,251 DEBUG: Starting get status operation
      2025-03-04 12:32:01,257 ERROR: Failed: Timed out waiting to power OFF
      

       

      140s later Kubevirt finally kills the VM, but fence_kubevirt already gave up long ago.
      And a reboot fails the same way, the VM eventually stops and then stays off

      For VMs that are hung its not hard to exceed the fence-kubevirt timeout.

      I'm not sure, but a fencing operation is usually a hard power off - ungraceful shutdown.
      Not waiting for the Guest to gracefully stop. Most of the times this is needed the VM would be malfunctioning in some way.

      Perhaps the timeouts need to be checked, or it should issue a force stop to Kubevirt instead of a graceful one.

      What is the impact of this issue to you?

      Fencing fails depending if the VM is hung.

      Please provide the package NVR for which the bug is seen:

      fence-agents-kubevirt-4.10.0-62.el9_4.10.x86_64

      How reproducible is this bug?:

      Always

      Steps to reproduce

      1. Kernel panic a VM without rebooting
      2. Use fence_kubevirt to reboot it

      Expected results

      VM rebooted immediately

      Actual results

      VM takes ages to power off (terminationGracePeriodSeconds) and stays off.

              rhn-engineering-oalbrigt Oyvind Albrigtsen
              rhn-support-gveitmic Germano Veit Michel
              Oyvind Albrigtsen Oyvind Albrigtsen
              Ilias Romanos Ilias Romanos
              Michal Stubna Michal Stubna
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated: