Loading...

Linking RHIVOS CVEs to...

Migration: Automation ...

SWIFT: POC Conversion

Sync from "Extern...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: rhel-9.7
Affects Version/s: rhel-9.4
Component/s: fence-agents
Labels:

Fixed in Build:
fence-agents-4.10.0-95.el9
Regression:
No
Severity:
Important
Keywords:

ZStream

AssignedTeam:
rhel-ha

Dev Target Milestone:
19
Internal Target Milestone:
23
Story Points:
8
Blocked:
False
Ready:
False
Blocked Reason:

Hide

None

Show
None
Product Documentation Required:
Yes
Sprint:
None
Release Blocker:
Approved Blocker

Git Pull Request:
https://github.com/ClusterLabs/fence-agents/pull/616
Preliminary Testing:
Pass
Errata Link:
https://errata.engineering.redhat.com/advisory/147364
Test Coverage:

Manual

Release Note Type:
Bug Fix
Release Note Text:

Hide
.`fence_kubevirt` powers off nodes instantly

Before this update, the `fence_kubevirt` agent performed a graceful shutdown of the node. This introduced a delay in the fencing process, as the node was not powered off immediately.

With this release, the agent has been modified to request an immediate, non-graceful shutdown.

As a result, when using the `fence_kubevirt` agent, nodes are instantly powered off.

Show
.`fence_kubevirt` powers off nodes instantly Before this update, the `fence_kubevirt` agent performed a graceful shutdown of the node. This introduced a delay in the fencing process, as the node was not powered off immediately. With this release, the agent has been modified to request an immediate, non-graceful shutdown. As a result, when using the `fence_kubevirt` agent, nodes are instantly powered off.
Release Note Status:
Done
ProdDocsReview-CCS:
Done
ProdDocsReview-Dev:
Done
ProdDocsReview-QE:
Not Required

Experience:
Architecture:

x86_64
RH Private Keywords:

PX Impact Score:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Planning:
None

What were you trying to do that didn't work?

A hung Virtual Machine, that has a kernel panic, BSOD etc, will usually not respond to any shutdown request (Guest Agent or ACPI), and will stay in that state. A very busy Virtual Machine or having contention issues may also take a longer time to shutdown.

What fence_kubvirt is doing for powering off is sending Kubevirt a simple stop request, which makes Kubevirt try to gracefully shutdown the VM, waiting until terminationGracePeriodSeconds passes before it finally kills/powers off the VM.

The problem is that terminationGracePeriodSeconds defaults to 180s for Linux VMs, and even longer for Windows (1 hour). So hung VMs will fail to power off in the eyes of fence_kubevirt if they take more than the fence-kubevirt timeout of 40s to shutdown. Which will always happen unless the customer has a custom terminationGracePeriodSeconds.

See, this on a paniced RHEL 7.9 guest:

# /usr/sbin/fence_kubevirt --kubeconfig=/root/config  --namespace=homelab -n rhel-79 -o off -vvv
2025-03-04 12:31:20,942 DEBUG: Starting get status operation
2025-03-04 12:31:20,952 DEBUG: Starting set status operation
2025-03-04 12:31:20,974 DEBUG: Starting get status operation
2025-03-04 12:31:21,980 DEBUG: Starting get status operation
2025-03-04 12:31:22,987 DEBUG: Starting get status operation
2025-03-04 12:31:23,993 DEBUG: Starting get status operation
2025-03-04 12:31:25,001 DEBUG: Starting get status operation
2025-03-04 12:31:26,007 DEBUG: Starting get status operation
2025-03-04 12:31:27,014 DEBUG: Starting get status operation
2025-03-04 12:31:28,021 DEBUG: Starting get status operation
2025-03-04 12:31:29,026 DEBUG: Starting get status operation
2025-03-04 12:31:30,033 DEBUG: Starting get status operation
2025-03-04 12:31:31,040 DEBUG: Starting get status operation
2025-03-04 12:31:32,046 DEBUG: Starting get status operation
2025-03-04 12:31:33,053 DEBUG: Starting get status operation
2025-03-04 12:31:34,062 DEBUG: Starting get status operation
2025-03-04 12:31:35,069 DEBUG: Starting get status operation
2025-03-04 12:31:36,076 DEBUG: Starting get status operation
2025-03-04 12:31:37,082 DEBUG: Starting get status operation
2025-03-04 12:31:38,089 DEBUG: Starting get status operation
2025-03-04 12:31:39,096 DEBUG: Starting get status operation
2025-03-04 12:31:40,101 DEBUG: Starting get status operation
2025-03-04 12:31:41,107 DEBUG: Starting get status operation
2025-03-04 12:31:42,114 DEBUG: Starting get status operation
2025-03-04 12:31:43,121 DEBUG: Starting get status operation
2025-03-04 12:31:44,127 DEBUG: Starting get status operation
2025-03-04 12:31:45,143 DEBUG: Starting get status operation
2025-03-04 12:31:46,151 DEBUG: Starting get status operation
2025-03-04 12:31:47,159 DEBUG: Starting get status operation
2025-03-04 12:31:48,171 DEBUG: Starting get status operation
2025-03-04 12:31:49,178 DEBUG: Starting get status operation
2025-03-04 12:31:50,184 DEBUG: Starting get status operation
2025-03-04 12:31:51,190 DEBUG: Starting get status operation
2025-03-04 12:31:52,197 DEBUG: Starting get status operation
2025-03-04 12:31:53,204 DEBUG: Starting get status operation
2025-03-04 12:31:54,211 DEBUG: Starting get status operation
2025-03-04 12:31:55,217 DEBUG: Starting get status operation
2025-03-04 12:31:56,223 DEBUG: Starting get status operation
2025-03-04 12:31:57,229 DEBUG: Starting get status operation
2025-03-04 12:31:58,236 DEBUG: Starting get status operation
2025-03-04 12:31:59,243 DEBUG: Starting get status operation
2025-03-04 12:32:00,251 DEBUG: Starting get status operation
2025-03-04 12:32:01,257 ERROR: Failed: Timed out waiting to power OFF

140s later Kubevirt finally kills the VM, but fence_kubevirt already gave up long ago.
And a reboot fails the same way, the VM eventually stops and then stays off

For VMs that are hung its not hard to exceed the fence-kubevirt timeout.

I'm not sure, but a fencing operation is usually a hard power off - ungraceful shutdown.
Not waiting for the Guest to gracefully stop. Most of the times this is needed the VM would be malfunctioning in some way.

Perhaps the timeouts need to be checked, or it should issue a force stop to Kubevirt instead of a graceful one.

What is the impact of this issue to you?

Fencing fails depending if the VM is hung.

Please provide the package NVR for which the bug is seen:

fence-agents-kubevirt-4.10.0-62.el9_4.10.x86_64

How reproducible is this bug?:

Always

Steps to reproduce

1. Kernel panic a VM without rebooting
2. Use fence_kubevirt to reboot it

Expected results

VM rebooted immediately

Actual results

VM takes ages to power off (terminationGracePeriodSeconds) and stays off.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screenshot 2025-03-06 at 8.36.29 am.png
48 kB
2025/03/05 10:43 PM

depends on

CNV-64608 Cannot enforce stopping a VM

Closed

is blocked by

CNV-64608 Cannot enforce stopping a VM

Closed

links to

[KCS] fence_kubevirt fails with "Timed out waiting to power OFF"

RHSA-2025:147364 fence-agents security update

Assignee:: Oyvind Albrigtsen

Reporter:: Germano Veit Michel

Developer:: Oyvind Albrigtsen

QA Contact:: Ilias Romanos

Doc Contact:: Michal Stubna

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2025/03/04 10:57 PM

Updated:: 2025/11/11 11:42 AM

Resolved:: 2025/11/11 11:42 AM

Dev Target end:: 2025/07/07

Target end:: 2025/08/04

Next Planned Release Date:: 2025/11/11

Release Date:: 2025/11/11

Details

Description

What were you trying to do that didn't work?

What is the impact of this issue to you?

Please provide the package NVR for which the bug is seen:

How reproducible is this bug?:

Steps to reproduce

Expected results

Actual results

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates