Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-47198

virt-handler is not updating the node label for extended duration after a kubelet outage

XMLWordPrintable

    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • ---
    • ---
    • None

      Description of problem:

      When a node is down in a OpenShift baremetal cluster with CNV installed, the virt-handler is not updating the node label to kubevirt.io/schedulable=false even after more than 60 minutes. The node has the proper taints set to avoid any scheduling and the kubevirt should reflect the same i.e kubevirt.io/schedulable=false instead of kubevirt.io/schedulable=true
      
      Taints:             node.kubernetes.io/unreachable:NoExecute
                          node.kubernetes.io/unreachable:NoSchedule
      
      This might be because the virt-handler stops running after kubelet stops responding:
      [root@cc37-h25-000-r750 ~]# oc logs pod/virt-handler-r7m9k
      Defaulted container "virt-handler" out of: virt-handler, virt-launcher (init)
      Error from server: Get "https://198.18.10.9:10250/containerLogs/openshift-cnv/virt-handler-r7m9k/virt-handler": dial tcp 198.18.10.9:10250: connect: connection refused.

      Version-Release number of selected component (if applicable):

      [root@cc37-h25-000-r750 ~]# oc get csv -n openshift-cnv
      NAME                                       DISPLAY                    VERSION   REPLACES                                   PHASE
      kubevirt-hyperconverged-operator.v4.15.0   OpenShift Virtualization   4.15.0    kubevirt-hyperconverged-operator.v4.14.7   Replacing
      kubevirt-hyperconverged-operator.v4.15.1   OpenShift Virtualization   4.15.1    kubevirt-hyperconverged-operator.v4.15.0   Pending
      [root@cc37-h25-000-r750 ~]# oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.33   True        True          45m     Unable to apply 4.15.22: an unknown error has occurred: MultipleErrors    

      How reproducible:

       Always

      Steps to Reproduce:

      1. Install OpenShift baremetal cluster with CNV
      2. Disrupt a worker node using https://github.com/krkn-chaos/krkn-hub/blob/main/docs/node-scenarios.md - node stop or systemctl stop kubelet               3. Observe the kubevirt.io/schedulable= label on the node
          

      Actual results:

      kubevirt.io/schedulable=true

      Expected results:

      Outage is detected and label is set to kubevirt.io/schedulable=false    

      Additional info:

      Logs and must-gather: https://drive.google.com/drive/folders/15U4cfCWbKnRrytd-S00TDsw6uNf3URB3?usp=sharing

              sgott@redhat.com Stuart Gott
              nelluri Naga Ravi Chaitanya Elluri
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: