-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20.z
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
OCP + OCP Virt using MachineHealthCheck + Self Node Remediation for Fencing.
If we crash a node.. example: by killing the kubelet or shutting down the worker, we hit a race condition when when the node is remediated (restarted) by self node remediation (SNR).
We see 2 distinct issues with admission.
1 for Virtual Machine - virt-launcher pods
1 of "regular" pods unrelated to VMs
In both cases, they are mounting persistent storage.
When a node is remediated, SNR cordons and places 2 taints on the node:
taints:
- effect: NoExecute
key: medik8s.io/remediation
timeAdded: "2025-11-12T21:16:58Z"
value: self-node-remediation
- effect: NoExecute
key: node.kubernetes.io/out-of-service
timeAdded: "2025-11-12T21:22:00Z"
value: nodeshutdown
The node is then NotReady and SchedulingDisabled
worker-cluster-6b9pp-2 NotReady,SchedulingDisabled worker 3d11h v1.33.5
When the node is remediated successfully, it reboots and comes back:
worker-cluster-6b9pp-2 Ready,SchedulingDisabled worker 3d11h v1.33.5
When we then see is pods in unrecoverable states:
For the VM's virt-launcher pod, it goes from running to terminating to UnexpectedAdmissionError to Init:0/1:
virt-launcher-rhel9-aquamarine-lynx-30-x2dbh 2/2 Running 0 5m28s 10.135.0.44 worker-cluster-6b9pp-2 <none> 1/1 virt-launcher-rhel9-aquamarine-lynx-30-x2dbh 2/2 Terminating 0 5m29s 10.135.0.44 worker-cluster-6b9pp-2 <none> 1/1 virt-launcher-rhel9-aquamarine-lynx-30-x2dbh 0/2 UnexpectedAdmissionError 0 6m49s <none> worker-cluster-6b9pp-2 <none> 1/1 virt-launcher-rhel9-aquamarine-lynx-30-x2dbh 0/2 Init:0/1 0 6m51s <none> worker-cluster-6b9pp-2 <none> 1/1
We see the pod is failed with an UnexpectedAdmissionError while simultaneously PodScheduled is true and state is Waiting with PodInitializing.
Status: Failed
Reason: UnexpectedAdmissionError
Message: Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Conditions:
Type Status
kubevirt.io/virtual-machine-unpaused True
DisruptionTarget True
PodReadyToStartContainers False
Initialized False
Ready False
ContainersReady False
PodScheduled True
Warning NodeNotReady 3m15s node-controller Node is not ready
Warning UnexpectedAdmissionError 105s kubelet Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected
When the node comes back from being restarted, we see the virt-launcher in SyncLoop ADD, but there is no container / pod sandbox created. It is stuck in this limbo state and the VM is not LiveMigrated to a new node.
I need a force deletion and at that point, it is spun up on a new node.
Nov 13 14:39:14.820534 worker-cluster-6b9pp-2 crio[2968]: time="2025-11-13T14:39:14.820496266Z" level=info msg="Deleting pod virtualmachines_virt-launcher-rhel9-aquamarine-lynx-30-x2dbh from CNI network \"multus-cni-network\" (type=multus-shim)" Nov 13 14:39:17.179614 worker-cluster-6b9pp-2 kubenswrapper[3047]: I1113 14:39:17.179477 3047 kubelet.go:2537] "SyncLoop ADD" source="api" pods=["openshift-ingress-canary/ingress-canary-qsnhb","openshift-multus/network-metrics-daemon-vg8kd","openshift-cnv/cdi-operator-79c6478778-9zrsq","openshift-cnv/kubevirt-console-plugin-685bdc7b85-xkhtt","openshift-cnv/virt-handler-djfkn","openshift-dns/node-resolver-gflt2","openshift-multus/multus-additional-cni-plugins-rts5g","openshift-network-operator/iptables-alerter-dgfk7","openshift-ovn-kubernetes/ovnkube-node-bz49c","openshift-cnv/hco-operator-76998c776c-6cf9k","openshift-cnv/hco-webhook-599849cdd9-44fsx","openshift-cnv/hostpath-provisioner-operator-69d6c6d747-c58mj","openshift-cnv/kube-cni-linux-bridge-plugin-wfxhm","openshift-cnv/kubevirt-apiserver-proxy-b6886fb79-7c2l8","openshift-cnv/virt-operator-6b97fd9d94-6hqdk","openshift-image-registry/node-ca-8zcq2","openshift-insights/insights-runtime-extractor-4n8nw","cert-manager-operator/cert-manager-operator-controller-manager-66555bc98f-pbwdb","openshift-cnv/cdi-apiserver-69676554fb-9zkb9","openshift-cnv/hyperconverged-cluster-cli-download-5597c4bdc4-96ltd","openshift-kni-infra/coredns-worker-cluster-6b9pp-2","openshift-kni-infra/keepalived-worker-cluster-6b9pp-2","openshift-machine-config-operator/kube-rbac-proxy-crio-worker-cluster-6b9pp-2","openshift-storage/ceph-csi-controller-manager-7b79ccfd44-4w6dd","openshift-storage/ocs-operator-6949bbd68c-x54gx","openshift-cnv/aaq-operator-5ddd8d89bd-dkhcs","openshift-gitops-operator/openshift-gitops-operator-controller-manager-5f746d874c-vp8td","openshift-gitops/openshift-gitops-repo-server-74c6cbfbfd-99kmr","openshift-monitoring/node-exporter-5m7bh","openshift-network-console/networking-console-plugin-69c55d76df-mvvpj","openshift-network-diagnostics/network-check-target-vv9vn","openshift-storage/csi-addons-controller-manager-85669ccf88-mn4j2","openshift-storage/odf-console-76586598f4-pvnp8","openshift-gitops/openshift-gitops-application-controller-0","openshift-machine-config-operator/machine-config-daemon-t6spq","virtualmachines/virt-launcher-rhel9-aquamarine-lynx-30-x2dbh","openshift-cluster-node-tuning-operator/tuned-jmvjj","openshift-cnv/bridge-marker-kcfj6","openshift-cnv/cluster-network-addons-operator-c8cc6ff98-w4htd","openshift-kube-descheduler-operator/descheduler-operator-7d4f6454c9-rw42f","openshift-multus/multus-d2wpb","openshift-workload-availability/self-node-remediation-ds-tl2hq","showroom-6b9pp-1/showroom-dbfdcc845-rxmtn","openshift-dns/dns-default-g86dq","openshift-gitops/openshift-gitops-dex-server-fdbf9d89f-r4kvl","openshift-gitops/openshift-gitops-server-5488b8bd69-mrjpx","openshift-ingress/router-default-977bcd56b-rfthg"] Nov 13 14:39:17.222953 worker-cluster-6b9pp-2 kubenswrapper[3047]: I1113 14:39:17.222792 3047 kubelet.go:2420] "Pod admission denied" podUID="73f6f4e2-14d6-4ea4-8ee3-ffea670488d8" pod="virtualmachines/virt-launcher-rhel9-aquamarine-lynx-30-x2dbh" reason="UnexpectedAdmissionError" message="Allocate failed due to no healthy devices present; cannot allocate unhealthy devices devices.kubevirt.io/vhost-net, which is unexpected"
For a non VM virt-launcher pod under the same conditions as above, we see the same PodScheduled is true and state is Waiting with PodInitializing, but this time TaintToleration rejection. I would assume if the virt-launcher did not hit the UnexpectedAdmissionError, it would then fail on the TaintToleration.
status: conditions: - lastProbeTime: null lastTransitionTime: "2025-11-13T14:37:58Z" message: 'Taint manager: deleting due to NoExecute taint' reason: DeletionByTaintManager status: "True" type: DisruptionTarget - lastProbeTime: null lastTransitionTime: "2025-11-13T14:39:28Z" status: "False" type: PodReadyToStartContainers - lastProbeTime: null lastTransitionTime: "2025-11-13T14:39:28Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2025-11-13T14:39:28Z" reason: PodFailed status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2025-11-13T14:39:28Z" reason: PodFailed status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2025-11-13T14:39:28Z" status: "True" type: PodScheduled state: waiting: reason: ContainerCreating message: 'Pod was rejected: Predicate TaintToleration failed: node(s) had taints that the pod didn''t tolerate' phase: Failed qosClass: Burstable reason: TaintToleration
Looking at the logs, we see the same situation where the SyncLoop ADD has the pod, followed by the rejection.
Nov 13 14:39:15.246269 worker-cluster-6b9pp-2 crio[2968]: time="2025-11-13T14:39:15.246162522Z" level=info msg="Deleting pod showroom-6b9pp-1_showroom-dbfdcc845-rxmtn from CNI network \"multus-cni-network\" (type=multus-shim)" Nov 13 14:39:17.179614 worker-cluster-6b9pp-2 kubenswrapper[3047]: I1113 14:39:17.179477 3047 kubelet.go:2537] "SyncLoop ADD" source="api" pods=["openshift-ingress-canary/ingress-canary-qsnhb","openshift-multus/network-metrics-daemon-vg8kd","openshift-cnv/cdi-operator-79c6478778-9zrsq","openshift-cnv/kubevirt-console-plugin-685bdc7b85-xkhtt","openshift-cnv/virt-handler-djfkn","openshift-dns/node-resolver-gflt2","openshift-multus/multus-additional-cni-plugins-rts5g","openshift-network-operator/iptables-alerter-dgfk7","openshift-ovn-kubernetes/ovnkube-node-bz49c","openshift-cnv/hco-operator-76998c776c-6cf9k","openshift-cnv/hco-webhook-599849cdd9-44fsx","openshift-cnv/hostpath-provisioner-operator-69d6c6d747-c58mj","openshift-cnv/kube-cni-linux-bridge-plugin-wfxhm","openshift-cnv/kubevirt-apiserver-proxy-b6886fb79-7c2l8","openshift-cnv/virt-operator-6b97fd9d94-6hqdk","openshift-image-registry/node-ca-8zcq2","openshift-insights/insights-runtime-extractor-4n8nw","cert-manager-operator/cert-manager-operator-controller-manager-66555bc98f-pbwdb","openshift-cnv/cdi-apiserver-69676554fb-9zkb9","openshift-cnv/hyperconverged-cluster-cli-download-5597c4bdc4-96ltd","openshift-kni-infra/coredns-worker-cluster-6b9pp-2","openshift-kni-infra/keepalived-worker-cluster-6b9pp-2","openshift-machine-config-operator/kube-rbac-proxy-crio-worker-cluster-6b9pp-2","openshift-storage/ceph-csi-controller-manager-7b79ccfd44-4w6dd","openshift-storage/ocs-operator-6949bbd68c-x54gx","openshift-cnv/aaq-operator-5ddd8d89bd-dkhcs","openshift-gitops-operator/openshift-gitops-operator-controller-manager-5f746d874c-vp8td","openshift-gitops/openshift-gitops-repo-server-74c6cbfbfd-99kmr","openshift-monitoring/node-exporter-5m7bh","openshift-network-console/networking-console-plugin-69c55d76df-mvvpj","openshift-network-diagnostics/network-check-target-vv9vn","openshift-storage/csi-addons-controller-manager-85669ccf88-mn4j2","openshift-storage/odf-console-76586598f4-pvnp8","openshift-gitops/openshift-gitops-application-controller-0","openshift-machine-config-operator/machine-config-daemon-t6spq","virtualmachines/virt-launcher-rhel9-aquamarine-lynx-30-x2dbh","openshift-cluster-node-tuning-operator/tuned-jmvjj","openshift-cnv/bridge-marker-kcfj6","openshift-cnv/cluster-network-addons-operator-c8cc6ff98-w4htd","openshift-kube-descheduler-operator/descheduler-operator-7d4f6454c9-rw42f","openshift-multus/multus-d2wpb","openshift-workload-availability/self-node-remediation-ds-tl2hq","showroom-6b9pp-1/showroom-dbfdcc845-rxmtn","openshift-dns/dns-default-g86dq","openshift-gitops/openshift-gitops-dex-server-fdbf9d89f-r4kvl","openshift-gitops/openshift-gitops-server-5488b8bd69-mrjpx","openshift-ingress/router-default-977bcd56b-rfthg"] Nov 13 14:39:17.214210 worker-cluster-6b9pp-2 kubenswrapper[3047]: I1113 14:39:17.214156 3047 predicate.go:212] "Predicate failed on Pod" pod="showroom-6b9pp-1/showroom-dbfdcc845-rxmtn" err="Predicate TaintToleration failed: node(s) had taints that the pod didn't tolerate" Nov 13 14:39:17.214210 worker-cluster-6b9pp-2 kubenswrapper[3047]: I1113 14:39:17.214194 3047 kubelet.go:2420] "Pod admission denied" podUID="49730073-bf5a-466a-9de5-64dc6a6cdbe3" pod="showroom-6b9pp-1/showroom-dbfdcc845-rxmtn" reason="TaintToleration" message="Predicate TaintToleration failed: node(s) had taints that the pod didn't tolerate"
Version-Release number of selected component (if applicable):
4.20.0
How reproducible:
Always
Steps to Reproduce:
1. Demo Lab Cluster
2. Setup default NHC / SNR
3. Create VM and or pod using PVC
4. Kill node
5. Wait for remediation
Actual results:
pods are not in a clean state vms do not get successfully migrated remediation does not complete post reboot and gets hung
Expected results:
pods / vm are migrated remediation completes successfully
Additional info: