Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18796

SNO node gets drained when creating a new SriovNetworkNodePolicy leading to existing pods getting recreated

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Low
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When creating a new SriovNetworkNodePolicy the SNO gets drained leading to all existing pods getting restarted. The pods eventually recover but this leads to some of the cluster operators and users workload becoming unavailable while the pods are getting terminated. 

      Version-Release number of selected component (if applicable):

      4.14.0-rc.0
      sriov-network-operator.v4.14.0-202308242104

      How reproducible:

      100%

      Steps to Reproduce:

      1. On an SNO create a SriovNetworkNodePolicy
      
      apiVersion: sriovnetwork.openshift.io/v1
      kind: SriovNetworkNodePolicy
      metadata:
        name: snnp1
        namespace: openshift-sriov-network-operator
      spec:
        deviceType: vfio-pci
        isRdma: false
        nicSelector:
          pfNames:
          - ens2f3#32-33
        nodeSelector:
          node-role.kubernetes.io/master: ""
        numVfs: 48
        priority: 99
        resourceName: snnp1
      
      
      2. Check sriov-network-config-daemon logs and other pods status
      

      Actual results:

      We can notice in the logs that the node gets drained and all the pods except DaemonSet-managed Pods get restarted.
      
      I0911 14:07:10.657178   22691 daemon.go:804] annotateNode(): Annotate node sno.kni-qe-24.lab.eng.rdu2.redhat.com with: Draining
      I0911 14:07:10.680706   22691 daemon.go:599] nodeStateSyncHandler(): pause MCP
      I0911 14:07:10.690525   22691 daemon.go:916] pauseMCP(): pausing MCP
      I0911 14:07:10.737330   22691 daemon.go:952] pauseMCP(): pause MCP master
      I0911 14:07:10.773584   22691 daemon.go:804] annotateNode(): Annotate node sno.kni-qe-24.lab.eng.rdu2.redhat.com with: Draining_MCP_Paused
      I0911 14:07:10.829568   22691 daemon.go:608] nodeStateSyncHandler(): drain node
      I0911 14:07:10.829828   22691 daemon.go:1004] drainNode(): Update prepared
      I0911 14:07:10.829862   22691 daemon.go:1014] drainNode(): Start draining
      E0911 14:07:13.663558   22691 daemon.go:137] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-r9xrc, openshift-dns/dns-default-wrr86, openshift-dns/node-resolver-4tj87, openshift-image-registry/node-ca-wphrb, openshift-ingress-canary/ingress-canary-nf7p9, openshift-local-storage/diskmaker-manager-44hcj, openshift-logging/collector-25gbm, openshift-machine-config-operator/machine-config-daemon-hrlhx, openshift-machine-config-operator/machine-config-server-v5pmr, openshift-monitoring/node-exporter-s6m5q, openshift-multus/multus-additional-cni-plugins-7pjff, openshift-multus/multus-networkpolicy-gkwwp, openshift-multus/multus-w525k, openshift-multus/network-metrics-daemon-l65kc, openshift-ovn-kubernetes/ovnkube-node-qw8fz, openshift-ptp/linuxptp-daemon-hsrz4, openshift-sriov-network-operator/network-resources-injector-h7lk8, openshift-sriov-network-operator/operator-webhook-zmbpq, openshift-sriov-network-operator/sriov-device-plugin-vnz49, openshift-sriov-network-operator/sriov-network-config-daemon-xjvwj, vran-acceleration-operators/accelerator-discovery-bl88c, vran-acceleration-operators/sriov-device-plugin-zhtxj, vran-acceleration-operators/sriov-fec-daemonset-gs2bn
      I0911 14:07:13.665220   22691 daemon.go:137] evicting pod openshift-apiserver-operator/openshift-apiserver-operator-869988898d-pxxdg
      I0911 14:07:13.665302   22691 daemon.go:137] evicting pod openshift-cluster-node-tuning-operator/cluster-node-tuning-operator-57f8dbf5f-q5p6n

      Expected results:

      Creating a new SriovNetworkNodePolicy on the SNO node doesn't lead to other pods restarts.

      Additional info:

      This appears to be a regression introduced in 4.14. The same scenario hasn't reproduced in 4.13.
      
      Attaching must-gather.

              rhn-support-imiller Ian Miller
              mcornea@redhat.com Marius Cornea
              None
              None
              Zhanqi Zhao Zhanqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: