Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-48220

Disrupting the worker node leads to 15 minutes node and VMIs downtime after the node is back

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • CNV v4.16.0
    • CNV Infrastructure
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • ---
    • ---
    • Medium
    • None

      Description of problem:

      When a worker node is disrupted - stopped and started, the node takes 15 minutes to get back to Ready after the disrupted node is back online on the infrastructure side. During that 15 minutes, all the VMIs/user workload running on the node are not ready i.e downtime for the user. 
      
      Node downtime log 
        Normal   NodeNotReady             20m (x2 over 70m)       node-controller      Node cc37-h35-000-r750 status is now: NodeNotReady
        Normal   NodeUnresponsive         17m                     node-controller      virt-handler is not responsive, marking node as unresponsive
        Normal   Starting                 83s                     kubelet              Starting kubelet.
        Normal   Starting                 73s                     kubelet              Starting kubelet.
        Normal   NodeAllocatableEnforced  72s                     kubelet              Updated Node Allocatable limit across pods
        Normal   NodeHasSufficientMemory  72s (x2 over 72s)       kubelet              Node cc37-h35-000-r750 status is now: NodeHasSufficientMemory
        Normal   NodeHasNoDiskPressure    72s (x2 over 72s)       kubelet              Node cc37-h35-000-r750 status is now: NodeHasNoDiskPressure
        Normal   NodeHasSufficientPID     72s (x2 over 72s)       kubelet              Node cc37-h35-000-r750 status is now: NodeHasSufficientPID
        Warning  Rebooted                 72s                     kubelet              Node cc37-h35-000-r750 has been rebooted, boot id: d9486e1d-2e00-4337-ba07-2ec85917952e
        Normal   NodeNotReady             72s                     kubelet              Node cc37-h35-000-r750 status is now: NodeNotReady
        Normal   NodeReady                33s                     kubelet              Node cc37-h35-000-r750 status is now: NodeReady
      
      VMIs status
      
      NAME                      AGE   PHASE     IP             NODENAME            READY
      windows-vm-8a4120d4-0     24m   Running   10.128.2.19    cc37-h35-000-r750   False
      windows-vm-8a4120d4-1     24m   Running   10.128.2.20    cc37-h35-000-r750   False
      windows-vm-8a4120d4-10    24m   Running   10.128.2.31    cc37-h35-000-r750   False
      
      
      After the node gets back to Ready state on the OpenShift, all the VMIs got migrated/moved to the other worker nodes. This is leading to target node being underutilized while other worker nodes will be overloaded depending on the user load. It might be better to migrate them during the outage to avoid extended downtime instead.
      
      

      Version-Release number of selected component (if applicable):

      4.16.6

      How reproducible:

        Always

      Steps to Reproduce:

      1. Install CNV on a 4.16 baremetal cluster with user workload/VMIs running  
      2. Use https://github.com/redhat-chaos/krkn-hub/blob/main/docs/node-scenarios.md to disrupt/stop and start one of the worker nodes
      3.  Observe the status of the node from both Kubernetes/OpenShift side as well as the infrastructure side
      4. Observe the status of the VMIs running on the targeted node

      Actual results:

      1. 15 mins for the node to get to Ready phase on the Kubernetes/OpenShift side and VMIs are all not Ready during that period
      2. VMIs got migrated after the node is Ready leading to under utilization on the targeted node while other worker nodes got overloaded.

      Expected results:

      Additional info:

      1. Node should recover fast to avoid VMIs downtime to the user
      2. Avoid migrating the VMIs after the node is back to Ready phase. Instead migrate them during the outage to avoid extended VMIs downtime where possible.

              gkapoor@redhat.com Geetika Kapoor
              nelluri Naga Ravi Chaitanya Elluri
              Geetika Kapoor Geetika Kapoor
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: