-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Description of problem:
After restarting a node or temporarily bringing down its interface—followed by deleting the VM associated pod—the VM that was previously running on that node fails to recover on a healthy node. Specifically, the corresponding VMI remains in a Pending state, and the virt-launcher pod is not created. Upon describing the VMI, I noticed the following message: [root@sdoc-bastion monitor]# oc describe vmi -A Name: vm-0 Namespace: pmtps1 Labels: kubevirt.io/vm=vm-0 podmon.dellemc.com/driver=csi-powerstore Annotations: kubevirt.io/latest-observed-api-version: v1 kubevirt.io/storage-observed-api-version: v1 kubevirt.io/vm-generation: 1 API Version: kubevirt.io/v1 Kind: VirtualMachineInstance Metadata: Creation Timestamp: 2025-04-29T09:10:53Z Finalizers: kubevirt.io/virtualMachineControllerFinalize foregroundDeleteVirtualMachine Generation: 2 Owner References: API Version: kubevirt.io/v1 Block Owner Deletion: true Controller: true Kind: VirtualMachine Name: vm-0 UID: 0ab8f836-ae53-4ff1-ba79-50900681616f Resource Version: 5331115 UID: dca14fe1-b6a8-4a91-9961-bc54a8a19ddf Spec: Architecture: amd64 Domain: Cpu: Cores: 1 Max Sockets: 4 Model: host-model Sockets: 1 Threads: 1 Devices: Disks: Disk: Bus: virtio Name: containerdisk Disk: Bus: virtio Name: cloudinitdisk Disk: Bus: virtio Name: filesystem-disk0 Interfaces: Masquerade: Name: default Features: Acpi: Enabled: true Firmware: Uuid: 65471cd8-c6e5-5992-ad0f-4838b99c19af Machine: Type: pc-q35-rhel9.4.0 Memory: Guest: 2G Resources: Requests: Memory: 2G Eviction Strategy: LiveMigrate Networks: Name: default Pod: Termination Grace Period Seconds: 0 Tolerations: Effect: NoExecute Key: node.kubernetes.io/unreachable Operator: Exists Toleration Seconds: 300 Volumes: Container Disk: Image: kubevirt/fedora-with-test-tooling-container-disk:devel Image Pull Policy: IfNotPresent Name: containerdisk Cloud Init No Cloud: User Data: #cloud-config password: fedora chpasswd: { expire: False } Name: cloudinitdisk Name: filesystem-disk0 Persistent Volume Claim: Claim Name: vm-filesystem-pvc0 Status: Conditions: Last Probe Time: 2025-04-29T09:11:03Z Last Transition Time: 2025-04-29T09:11:03Z Message: virt-launcher pod has not yet been scheduled Reason: PodNotExists Status: False Type: Ready Last Probe Time: <nil> Last Transition Time: 2025-04-29T09:11:03Z Message: failed to create virtual machine pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded Reason: FailedCreate Status: False Type: Synchronized Current CPU Topology: Cores: 1 Sockets: 1 Threads: 1 Guest OS Info: Memory: Guest At Boot: 2G Guest Current: 2G Guest Requested: 2G Phase: Pending Phase Transition Timestamps: Phase: Pending Phase Transition Timestamp: 2025-04-29T09:10:53Z Runtime User: 107 Virtual Machine Revision Name: revision-start-vm-0ab8f836-ae53-4ff1-ba79-50900681616f-1 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 11s disruptionbudget-controller Created PodDisruptionBudget kubevirt-disruption-budget-xtm46 Warning FailedCreate 1s virtualmachine-controller Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded Warning FailedCreate <invalid> (x13 over 1s) virtualmachine-controller Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": no endpoints available for service "kubevirt-ipam-controller-webhook-service" [root@sdoc-bastion monitor]# up on checking this pod, It is going to CrashLoopBackOff state, and [root@sdoc-bastion monitor]# oc get pods -n openshift-cnv | grep ipam kubevirt-ipam-controller-manager-66c7d4d75-pcz82 0/1 CrashLoopBackOff 26 (2m41s ago) 19h [root@sdoc-bastion monitor]#
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
We performed a reboot of the worker nodes one at a time, following the flow outlined below: The reproduce is using podmon 1. Deploy a VM with the podmon (Resiliency) label. 2. Wait for the VM to reach the Running state. Once the virt-launcher pod is up and running, proceed to the next step. 3. Induce a failure (e.g., node reboot or interface down). 4. Wait for Resiliency to add taints to the node due to the failure. 5. Delete the terminating pod. 6. Kubernetes will then reschedule the VM/VMI/Pod on a healthy node. 7. Once the failed node comes back online, perform cleanup on the storage and clear the taints from the node.
Actual results:
Expected results:
Additional info:
After manually executing this oc rollout restart deployment kubevirt-ipam-controller-manager -n openshift-cnv is resolving the issue,
- is triggering
-
CNV-65179 [Chaos][TC] Add a case for CNV-61005 (IPAM webbook)
-
- New
-