-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.16
-
Important
-
No
-
False
-
-
Release Note Not Required
-
In Progress
-
Description of problem:
Creating conatinerRuntimeConfig with defaultruntime=crun, while a performanceprofile exists on the cluster, is failing, and the respective node get stuck NotReady,SchedulingDisabled and never recover. Note that enabling crun without PP on the cluster is done successfully w/o complications. kubeletlogs on the hanging node are showing varius errors indicating that the container runtime is down (for the full log please find the link in the first comment): Jan 10 12:47:26.748835 ip-10-0-31-44 kubenswrapper[2528]: rpc error: code = Unknown desc = container create failed: writing file `pids.max`: Invalid argument .. Jan 10 12:10:59.981387 ip-10-0-31-44 kubenswrapper[2192]: I0110 12:10:59.981374 2192 container_manager_linux.go:268] "Container manager verified user specified cgroup-root exists" cgroupRoot=[] Jan 10 12:11:00.162038 ip-10-0-31-44 kubenswrapper[2192]: W0110 12:11:00.128927 2192 watcher.go:93] Error while processing event ("/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice": 0x40000100 == IN_CREATE|IN_ISDIR): readdirent /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice: no such file or directory .. Jan 10 12:50:27.522468 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:27.522401 2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-dns/dns-default-nmbz9" podUID="d045aff3-9fde-49d7-be7c-af5003b73428" Jan 10 12:50:27.522656 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:27.522576 2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-e2e-loki/loki-promtail-wqc74" podUID="02334c43-6773-4bdd-b6b8-25b9bc537e17" Jan 10 12:50:28.521840 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:28.521732 2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-network-diagnostics/network-check-target-sdmp7" podUID="0b33815d-3c8a-4224-8ac8-4fa3c1d26add"
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1.install 4.16 OCP starting 4.16.0-0.nightly-2024-01-03-193825 2.apply PP with minimal config: apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: manual spec: cpu: isolated: "1,3" reserved: "0,2" nodeSelector: node-role.kubernetes.io/worker: "" 3.apply crun config: apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: enable-crun spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: '' containerRuntimeConfig: logLevel: debug overlaySize: 8G defaultRuntime: "crun"
Actual results:
node is hanging at reboot and never recover: ip-10-0-31-44.ec2.internal NotReady,SchedulingDisabled worker 110m v1.29.0+e58658f
Expected results:
applying the runtime config should pass successfully.
Additional info:
0. This reproduces on BM as well as VM worker nodes. 1.applying the same runtime config on a cluster without PP encountered no issues. mustgather and kubeletlogs on the hanging node will be provided in a comment as a gdrive link. 2. It only affects 4.16. 3. It was reported that when the PP has cgroupv2 enabled, and apply the crun config, there were no complications. PP with cgroupv2 example: [root@cnfdr32 ~]# oc get performanceprofile/performance -o yaml apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: annotations: performance.openshift.io/ignore-cgroups-version: "true" creationTimestamp: "2024-01-10T12:10:01Z" finalizers: - foreground-deletion generation: 36 name: performance resourceVersion: "77140" uid: 1d6d8735-9579-4b0d-89b7-ae250aa72c25 spec: cpu: isolated: 4,6,8,10,12,14,16,18,20,22,9,11,13,15,17,19,21,23,5,7 reserved: 0,2,1,3 hugepages: defaultHugepagesSize: 1G pages: - count: 20 size: 2M machineConfigPoolSelector: machineconfiguration.openshift.io/role: worker-cnf nodeSelector: node-role.kubernetes.io/worker-cnf: "" numa: topologyPolicy: single-numa-node realTimeKernel: enabled: true workloadHints: realTime: true