Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-61005

Ipam-extentions falls to crashloop after node reboot

XMLWordPrintable

    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None

      Description of problem:

      After restarting a node or temporarily bringing down its interface—followed by deleting the VM associated pod—the VM that was previously running on that node fails to recover on a healthy node. Specifically, the corresponding VMI remains in a Pending state, and the virt-launcher pod is not created.
      
      Upon describing the VMI, I noticed the following message:
      [root@sdoc-bastion monitor]# oc describe vmi -A
      Name:         vm-0
      Namespace:    pmtps1
      Labels:       kubevirt.io/vm=vm-0
                    podmon.dellemc.com/driver=csi-powerstore
      Annotations:  kubevirt.io/latest-observed-api-version: v1
                    kubevirt.io/storage-observed-api-version: v1
                    kubevirt.io/vm-generation: 1
      API Version:  kubevirt.io/v1
      Kind:         VirtualMachineInstance
      Metadata:
        Creation Timestamp:  2025-04-29T09:10:53Z
        Finalizers:
          kubevirt.io/virtualMachineControllerFinalize
          foregroundDeleteVirtualMachine
        Generation:  2
        Owner References:
          API Version:           kubevirt.io/v1
          Block Owner Deletion:  true
          Controller:            true
          Kind:                  VirtualMachine
          Name:                  vm-0
          UID:                   0ab8f836-ae53-4ff1-ba79-50900681616f
        Resource Version:        5331115
        UID:                     dca14fe1-b6a8-4a91-9961-bc54a8a19ddf
      Spec:
        Architecture:  amd64
        Domain:
          Cpu:
            Cores:        1
            Max Sockets:  4
            Model:        host-model
            Sockets:      1
            Threads:      1
          Devices:
            Disks:
              Disk:
                Bus:  virtio
              Name:   containerdisk
              Disk:
                Bus:  virtio
              Name:   cloudinitdisk
              Disk:
                Bus:  virtio
              Name:   filesystem-disk0
            Interfaces:
              Masquerade:
              Name:  default
          Features:
            Acpi:
              Enabled:  true
          Firmware:
            Uuid:  65471cd8-c6e5-5992-ad0f-4838b99c19af
          Machine:
            Type:  pc-q35-rhel9.4.0
          Memory:
            Guest:  2G
          Resources:
            Requests:
              Memory:       2G
        Eviction Strategy:  LiveMigrate
        Networks:
          Name:  default
          Pod:
        Termination Grace Period Seconds:  0
        Tolerations:
          Effect:              NoExecute
          Key:                 node.kubernetes.io/unreachable
          Operator:            Exists
          Toleration Seconds:  300
        Volumes:
          Container Disk:
            Image:              kubevirt/fedora-with-test-tooling-container-disk:devel
            Image Pull Policy:  IfNotPresent
          Name:                 containerdisk
          Cloud Init No Cloud:
            User Data:  #cloud-config
      password: fedora
      chpasswd: { expire: False }
          Name:  cloudinitdisk
          Name:  filesystem-disk0
          Persistent Volume Claim:
            Claim Name:  vm-filesystem-pvc0
      Status:
        Conditions:
          Last Probe Time:       2025-04-29T09:11:03Z
          Last Transition Time:  2025-04-29T09:11:03Z
          Message:               virt-launcher pod has not yet been scheduled
          Reason:                PodNotExists
          Status:                False
          Type:                  Ready
          Last Probe Time:       <nil>
          Last Transition Time:  2025-04-29T09:11:03Z
          Message:               failed to create virtual machine pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded
          Reason:                FailedCreate
          Status:                False
          Type:                  Synchronized
        Current CPU Topology:
          Cores:    1
          Sockets:  1
          Threads:  1
        Guest OS Info:
        Memory:
          Guest At Boot:    2G
          Guest Current:    2G
          Guest Requested:  2G
        Phase:              Pending
        Phase Transition Timestamps:
          Phase:                        Pending
          Phase Transition Timestamp:   2025-04-29T09:10:53Z
        Runtime User:                   107
        Virtual Machine Revision Name:  revision-start-vm-0ab8f836-ae53-4ff1-ba79-50900681616f-1
      Events:
        Type     Reason            Age                      From                         Message
        ----     ------            ----                     ----                         -------
        Normal   SuccessfulCreate  11s                      disruptionbudget-controller  Created PodDisruptionBudget kubevirt-disruption-budget-xtm46
        Warning  FailedCreate      1s                       virtualmachine-controller    Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded
        Warning  FailedCreate      <invalid> (x13 over 1s)  virtualmachine-controller    Error creating pod: Internal error occurred: failed calling webhook "ipam-claims.k8s.cni.cncf.io": failed to call webhook: Post "https://kubevirt-ipam-controller-webhook-service.openshift-cnv.svc:443/mutate-v1-pod?timeout=10s": no endpoints available for service "kubevirt-ipam-controller-webhook-service"
      [root@sdoc-bastion monitor]#
      
      
      up on checking this pod, It is going to CrashLoopBackOff  state,
       and
      [root@sdoc-bastion monitor]#  oc get pods -n openshift-cnv | grep ipam
      kubevirt-ipam-controller-manager-66c7d4d75-pcz82       0/1     CrashLoopBackOff   26 (2m41s ago)    19h
      [root@sdoc-bastion monitor]#
      
      
      
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      We performed a reboot of the worker nodes one at a time, following the flow outlined below:
      The reproduce is using podmon
      1. Deploy a VM with the podmon (Resiliency) label.
      2. Wait for the VM to reach the Running state. Once the virt-launcher pod is up and running, proceed to the next step. 
      3. Induce a failure (e.g., node reboot or interface down).
      4. Wait for Resiliency to add taints to the node due to the failure.
      5. Delete the terminating pod.
      6. Kubernetes will then reschedule the VM/VMI/Pod on a healthy node.
      7. Once the failed node comes back online, perform cleanup on the storage and clear the taints from the node.

      Actual results:

       

      Expected results:

       

      Additional info:

      After manually executing this
      oc rollout restart deployment kubevirt-ipam-controller-manager -n openshift-cnv
      is resolving the issue,

              ysegev@redhat.com Yossi Segev
              ralavi@redhat.com Ram Lavi
              Yossi Segev Yossi Segev
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: