-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Description of problem:
After restarting a node or temporarily bringing down its interface—followed by deleting the VM associated pod—the VM that was previously running on that node fails to recover on a healthy node. Specifically, the corresponding VMI remains in a Pending state, and the virt-launcher pod is not created.
Upon describing the VMI, I noticed the following message:
[root@sdoc-bastion monitor]# oc describe vmi -A
Name: vm-0
Namespace: pmtps1
Labels: kubevirt.io/vm=vm-0
podmon.dellemc.com/driver=csi-powerstore
Annotations: kubevirt.io/latest-observed-api-version: v1
kubevirt.io/storage-observed-api-version: v1
kubevirt.io/vm-generation: 1
API Version: kubevirt.io/v1
Kind: VirtualMachineInstance
Metadata:
Creation Timestamp: 2025-04-29T09:10:53Z
Finalizers:
kubevirt.io/virtualMachineControllerFinalize
foregroundDeleteVirtualMachine
Generation: 2
Owner References:
API Version: kubevirt.io/v1
Block Owner Deletion: true
Controller: true
Kind: VirtualMachine
Name: vm-0
UID: 0ab8f836-ae53-4ff1-ba79-50900681616f
Resource Version: 5331115
UID: dca14fe1-b6a8-4a91-9961-bc54a8a19ddf
Spec:
Architecture: amd64
Domain:
Cpu:
Cores: 1
Max Sockets: 4
Model: host-model
Sockets: 1
Threads: 1
Devices:
Disks:
Disk:
Bus: virtio
Name: containerdisk
Disk:
Bus: virtio
Name: cloudinitdisk
Disk:
Bus: virtio
Name: filesystem-disk0
Interfaces:
Masquerade:
Name: default
Features:
Acpi:
Enabled: true
Firmware:
Uuid: 65471cd8-c6e5-5992-ad0f-4838b99c19af
Machine:
Type: pc-q35-rhel9.4.0
Memory:
Guest: 2G
Resources:
Requests:
Memory: 2G
Eviction Strategy: LiveMigrate
Networks:
Name: default
Pod:
Termination Grace Period Seconds: 0
Tolerations:
Effect: NoExecute
Key: node.kubernetes.io/unreachable
Operator: Exists
Toleration Seconds: 300
Volumes:
Container Disk:
Image: kubevirt/fedora-with-test-tooling-container-disk:devel
Image Pull Policy: IfNotPresent
Name: containerdisk
Cloud Init No Cloud:
User Data: #cloud-config
password: fedora
chpasswd: { expire: False }
Name: cloudinitdisk
Name: filesystem-disk0
Persistent Volume Claim:
Claim Name: vm-filesystem-pvc0
Status:
Conditions:
Last Probe Time: 2025-04-29T09:11:03Z
Last Transition Time: 2025-04-29T09:11:03Z
Message: virt-launcher pod has not yet been scheduled
Reason: PodNotExists
Status: False
Type: Ready
Last Probe Time: <nil>
Last Transition Time: 2025-04-29T09:11:03Z
Message: failed to create virtual machine pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded
Reason: FailedCreate
Status: False
Type: Synchronized
Current CPU Topology:
Cores: 1
Sockets: 1
Threads: 1
Guest OS Info:
Memory:
Guest At Boot: 2G
Guest Current: 2G
Guest Requested: 2G
Phase: Pending
Phase Transition Timestamps:
Phase: Pending
Phase Transition Timestamp: 2025-04-29T09:10:53Z
Runtime User: 107
Virtual Machine Revision Name: revision-start-vm-0ab8f836-ae53-4ff1-ba79-50900681616f-1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 11s disruptionbudget-controller Created PodDisruptionBudget kubevirt-disruption-budget-xtm46
Warning FailedCreate 1s virtualmachine-controller Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded
Warning FailedCreate <invalid> (x13 over 1s) virtualmachine-controller Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": no endpoints available for service "kubevirt-ipam-controller-webhook-service"
[root@sdoc-bastion monitor]#
up on checking this pod, It is going to CrashLoopBackOff state,
and
[root@sdoc-bastion monitor]# oc get pods -n openshift-cnv | grep ipam
kubevirt-ipam-controller-manager-66c7d4d75-pcz82 0/1 CrashLoopBackOff 26 (2m41s ago) 19h
[root@sdoc-bastion monitor]#
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
We performed a reboot of the worker nodes one at a time, following the flow outlined below: The reproduce is using podmon 1. Deploy a VM with the podmon (Resiliency) label. 2. Wait for the VM to reach the Running state. Once the virt-launcher pod is up and running, proceed to the next step. 3. Induce a failure (e.g., node reboot or interface down). 4. Wait for Resiliency to add taints to the node due to the failure. 5. Delete the terminating pod. 6. Kubernetes will then reschedule the VM/VMI/Pod on a healthy node. 7. Once the failed node comes back online, perform cleanup on the storage and clear the taints from the node.
Actual results:
Expected results:
Additional info:
After manually executing this oc rollout restart deployment kubevirt-ipam-controller-manager -n openshift-cnv is resolving the issue,
- is triggering
-
CNV-65179 [Chaos][TC] Add a case for CNV-61005 (IPAM webbook)
-
- New
-