Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Component/s: CNV Network
Labels:

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Component Fix Version(s):
None
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

After restarting a node or temporarily bringing down its interface—followed by deleting the VM associated pod—the VM that was previously running on that node fails to recover on a healthy node. Specifically, the corresponding VMI remains in a Pending state, and the virt-launcher pod is not created.

Upon describing the VMI, I noticed the following message:
[root@sdoc-bastion monitor]# oc describe vmi -A
Name:         vm-0
Namespace:    pmtps1
Labels:       kubevirt.io/vm=vm-0
              podmon.dellemc.com/driver=csi-powerstore
Annotations:  kubevirt.io/latest-observed-api-version: v1
              kubevirt.io/storage-observed-api-version: v1
              kubevirt.io/vm-generation: 1
API Version:  kubevirt.io/v1
Kind:         VirtualMachineInstance
Metadata:
  Creation Timestamp:  2025-04-29T09:10:53Z
  Finalizers:
    kubevirt.io/virtualMachineControllerFinalize
    foregroundDeleteVirtualMachine
  Generation:  2
  Owner References:
    API Version:           kubevirt.io/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  VirtualMachine
    Name:                  vm-0
    UID:                   0ab8f836-ae53-4ff1-ba79-50900681616f
  Resource Version:        5331115
  UID:                     dca14fe1-b6a8-4a91-9961-bc54a8a19ddf
Spec:
  Architecture:  amd64
  Domain:
    Cpu:
      Cores:        1
      Max Sockets:  4
      Model:        host-model
      Sockets:      1
      Threads:      1
    Devices:
      Disks:
        Disk:
          Bus:  virtio
        Name:   containerdisk
        Disk:
          Bus:  virtio
        Name:   cloudinitdisk
        Disk:
          Bus:  virtio
        Name:   filesystem-disk0
      Interfaces:
        Masquerade:
        Name:  default
    Features:
      Acpi:
        Enabled:  true
    Firmware:
      Uuid:  65471cd8-c6e5-5992-ad0f-4838b99c19af
    Machine:
      Type:  pc-q35-rhel9.4.0
    Memory:
      Guest:  2G
    Resources:
      Requests:
        Memory:       2G
  Eviction Strategy:  LiveMigrate
  Networks:
    Name:  default
    Pod:
  Termination Grace Period Seconds:  0
  Tolerations:
    Effect:              NoExecute
    Key:                 node.kubernetes.io/unreachable
    Operator:            Exists
    Toleration Seconds:  300
  Volumes:
    Container Disk:
      Image:              kubevirt/fedora-with-test-tooling-container-disk:devel
      Image Pull Policy:  IfNotPresent
    Name:                 containerdisk
    Cloud Init No Cloud:
      User Data:  #cloud-config
password: fedora
chpasswd: { expire: False }
    Name:  cloudinitdisk
    Name:  filesystem-disk0
    Persistent Volume Claim:
      Claim Name:  vm-filesystem-pvc0
Status:
  Conditions:
    Last Probe Time:       2025-04-29T09:11:03Z
    Last Transition Time:  2025-04-29T09:11:03Z
    Message:               virt-launcher pod has not yet been scheduled
    Reason:                PodNotExists
    Status:                False
    Type:                  Ready
    Last Probe Time:       <nil>
    Last Transition Time:  2025-04-29T09:11:03Z
    Message:               failed to create virtual machine pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded
    Reason:                FailedCreate
    Status:                False
    Type:                  Synchronized
  Current CPU Topology:
    Cores:    1
    Sockets:  1
    Threads:  1
  Guest OS Info:
  Memory:
    Guest At Boot:    2G
    Guest Current:    2G
    Guest Requested:  2G
  Phase:              Pending
  Phase Transition Timestamps:
    Phase:                        Pending
    Phase Transition Timestamp:   2025-04-29T09:10:53Z
  Runtime User:                   107
  Virtual Machine Revision Name:  revision-start-vm-0ab8f836-ae53-4ff1-ba79-50900681616f-1
Events:
  Type     Reason            Age                      From                         Message
  ----     ------            ----                     ----                         -------
  Normal   SuccessfulCreate  11s                      disruptionbudget-controller  Created PodDisruptionBudget kubevirt-disruption-budget-xtm46
  Warning  FailedCreate      1s                       virtualmachine-controller    Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded
  Warning  FailedCreate      <invalid> (x13 over 1s)  virtualmachine-controller    Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": no endpoints available for service "kubevirt-ipam-controller-webhook-service"
[root@sdoc-bastion monitor]#


up on checking this pod, It is going to CrashLoopBackOff  state,
 and
[root@sdoc-bastion monitor]#  oc get pods -n openshift-cnv | grep ipam
kubevirt-ipam-controller-manager-66c7d4d75-pcz82       0/1     CrashLoopBackOff   26 (2m41s ago)    19h
[root@sdoc-bastion monitor]#

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

We performed a reboot of the worker nodes one at a time, following the flow outlined below:
The reproduce is using podmon
1. Deploy a VM with the podmon (Resiliency) label.
2. Wait for the VM to reach the Running state. Once the virt-launcher pod is up and running, proceed to the next step. 
3. Induce a failure (e.g., node reboot or interface down).
4. Wait for Resiliency to add taints to the node due to the failure.
5. Delete the terminating pod.
6. Kubernetes will then reschedule the VM/VMI/Pod on a healthy node.
7. Once the failed node comes back online, perform cleanup on the storage and clear the taints from the node.

Actual results:

Expected results:

Additional info:

After manually executing this
oc rollout restart deployment kubevirt-ipam-controller-manager -n openshift-cnv
is resolving the issue,

is triggering

CNV-65179 [Chaos][TC] Add a case for CNV-61005 (IPAM webbook)

Assignee:: Yoss Segev

Reporter:: Ram Lavi

QA Contact:: Yoss Segev

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/05/04 8:14 AM

Updated:: 2025/08/06 12:52 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide