-
Bug
-
Resolution: Cannot Reproduce
-
Undefined
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
Low
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When creating a new SriovNetworkNodePolicy the SNO gets drained leading to all existing pods getting restarted. The pods eventually recover but this leads to some of the cluster operators and users workload becoming unavailable while the pods are getting terminated.
Version-Release number of selected component (if applicable):
4.14.0-rc.0 sriov-network-operator.v4.14.0-202308242104
How reproducible:
100%
Steps to Reproduce:
1. On an SNO create a SriovNetworkNodePolicy apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: snnp1 namespace: openshift-sriov-network-operator spec: deviceType: vfio-pci isRdma: false nicSelector: pfNames: - ens2f3#32-33 nodeSelector: node-role.kubernetes.io/master: "" numVfs: 48 priority: 99 resourceName: snnp1 2. Check sriov-network-config-daemon logs and other pods status
Actual results:
We can notice in the logs that the node gets drained and all the pods except DaemonSet-managed Pods get restarted. I0911 14:07:10.657178 22691 daemon.go:804] annotateNode(): Annotate node sno.kni-qe-24.lab.eng.rdu2.redhat.com with: Draining I0911 14:07:10.680706 22691 daemon.go:599] nodeStateSyncHandler(): pause MCP I0911 14:07:10.690525 22691 daemon.go:916] pauseMCP(): pausing MCP I0911 14:07:10.737330 22691 daemon.go:952] pauseMCP(): pause MCP master I0911 14:07:10.773584 22691 daemon.go:804] annotateNode(): Annotate node sno.kni-qe-24.lab.eng.rdu2.redhat.com with: Draining_MCP_Paused I0911 14:07:10.829568 22691 daemon.go:608] nodeStateSyncHandler(): drain node I0911 14:07:10.829828 22691 daemon.go:1004] drainNode(): Update prepared I0911 14:07:10.829862 22691 daemon.go:1014] drainNode(): Start draining E0911 14:07:13.663558 22691 daemon.go:137] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-r9xrc, openshift-dns/dns-default-wrr86, openshift-dns/node-resolver-4tj87, openshift-image-registry/node-ca-wphrb, openshift-ingress-canary/ingress-canary-nf7p9, openshift-local-storage/diskmaker-manager-44hcj, openshift-logging/collector-25gbm, openshift-machine-config-operator/machine-config-daemon-hrlhx, openshift-machine-config-operator/machine-config-server-v5pmr, openshift-monitoring/node-exporter-s6m5q, openshift-multus/multus-additional-cni-plugins-7pjff, openshift-multus/multus-networkpolicy-gkwwp, openshift-multus/multus-w525k, openshift-multus/network-metrics-daemon-l65kc, openshift-ovn-kubernetes/ovnkube-node-qw8fz, openshift-ptp/linuxptp-daemon-hsrz4, openshift-sriov-network-operator/network-resources-injector-h7lk8, openshift-sriov-network-operator/operator-webhook-zmbpq, openshift-sriov-network-operator/sriov-device-plugin-vnz49, openshift-sriov-network-operator/sriov-network-config-daemon-xjvwj, vran-acceleration-operators/accelerator-discovery-bl88c, vran-acceleration-operators/sriov-device-plugin-zhtxj, vran-acceleration-operators/sriov-fec-daemonset-gs2bn I0911 14:07:13.665220 22691 daemon.go:137] evicting pod openshift-apiserver-operator/openshift-apiserver-operator-869988898d-pxxdg I0911 14:07:13.665302 22691 daemon.go:137] evicting pod openshift-cluster-node-tuning-operator/cluster-node-tuning-operator-57f8dbf5f-q5p6n
Expected results:
Creating a new SriovNetworkNodePolicy on the SNO node doesn't lead to other pods restarts.
Additional info:
This appears to be a regression introduced in 4.14. The same scenario hasn't reproduced in 4.13. Attaching must-gather.