Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-26589

4.16 cluster fails to apply crun containerRuntimeConfig when performance profile exists

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

          Creating conatinerRuntimeConfig with defaultruntime=crun, while a performanceprofile exists on the cluster, is failing, and the respective node get stuck NotReady,SchedulingDisabled and never recover. 
      
      Note that enabling crun without PP on the cluster is done successfully w/o complications.
      
      kubeletlogs on the hanging node are showing varius errors indicating that the container runtime is down (for the full log please find the link in the first comment):
      
      Jan 10 12:47:26.748835 ip-10-0-31-44 kubenswrapper[2528]:         rpc error: code = Unknown desc = container create failed: writing file `pids.max`: Invalid argument
      ..
      Jan 10 12:10:59.981387 ip-10-0-31-44 kubenswrapper[2192]: I0110 12:10:59.981374    2192 container_manager_linux.go:268] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
      Jan 10 12:11:00.162038 ip-10-0-31-44 kubenswrapper[2192]: W0110 12:11:00.128927    2192 watcher.go:93] Error while processing event ("/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice": 0x40000100 == IN_CREATE|IN_ISDIR): readdirent /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice: no such file or directory
      ..
      
      Jan 10 12:50:27.522468 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:27.522401    2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-dns/dns-default-nmbz9" podUID="d045aff3-9fde-49d7-be7c-af5003b73428"
      Jan 10 12:50:27.522656 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:27.522576    2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-e2e-loki/loki-promtail-wqc74" podUID="02334c43-6773-4bdd-b6b8-25b9bc537e17"
      Jan 10 12:50:28.521840 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:28.521732    2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-network-diagnostics/network-check-target-sdmp7" podUID="0b33815d-3c8a-4224-8ac8-4fa3c1d26add"

       

      Version-Release number of selected component (if applicable):

          4.16

      How reproducible:

          always

      Steps to Reproduce:

          1.install 4.16 OCP starting  4.16.0-0.nightly-2024-01-03-193825
          2.apply PP with minimal config:
      
      apiVersion: performance.openshift.io/v2
      kind: PerformanceProfile
      metadata:
        name: manual
      spec:
        cpu:
          isolated: "1,3"
          reserved: "0,2"
        nodeSelector:
          node-role.kubernetes.io/worker: ""
      
          3.apply crun config:
      
      apiVersion: machineconfiguration.openshift.io/v1
      kind: ContainerRuntimeConfig
      metadata:
        name: enable-crun
      spec:
        machineConfigPoolSelector:
          matchLabels:
            pools.operator.machineconfiguration.openshift.io/worker: ''
        containerRuntimeConfig:
          logLevel: debug 
          overlaySize: 8G 
          defaultRuntime: "crun"
      
      

      Actual results:

          node is hanging at reboot and never recover:
      ip-10-0-31-44.ec2.internal     NotReady,SchedulingDisabled   worker                 110m   v1.29.0+e58658f
      

      Expected results:

          applying the runtime config should pass successfully.

      Additional info:

         0. This reproduces on BM as well as VM worker nodes.
         1.applying the same runtime config on a cluster without PP encountered no issues.
      mustgather and kubeletlogs on the hanging node will be provided in a comment as a gdrive link. 
         2. It only affects 4.16. 
         3. It was reported that when the PP has cgroupv2 enabled, and apply the crun config, there were no complications. PP with cgroupv2 example:
      
      [root@cnfdr32 ~]# oc get performanceprofile/performance -o yaml
      apiVersion: performance.openshift.io/v2
      kind: PerformanceProfile
      metadata:
        annotations:
          performance.openshift.io/ignore-cgroups-version: "true"
        creationTimestamp: "2024-01-10T12:10:01Z"
        finalizers:
        - foreground-deletion
        generation: 36
        name: performance
        resourceVersion: "77140"
        uid: 1d6d8735-9579-4b0d-89b7-ae250aa72c25
      spec:
        cpu:
          isolated: 4,6,8,10,12,14,16,18,20,22,9,11,13,15,17,19,21,23,5,7
          reserved: 0,2,1,3
        hugepages:
          defaultHugepagesSize: 1G
          pages:
          - count: 20
            size: 2M
        machineConfigPoolSelector:
          machineconfiguration.openshift.io/role: worker-cnf
        nodeSelector:
          node-role.kubernetes.io/worker-cnf: ""
        numa:
          topologyPolicy: single-numa-node
        realTimeKernel:
          enabled: true
        workloadHints:
          realTime: true

              titzhak Talor Itzhak
              rhn-support-shajmakh Shereen Haj
              Gowrishankar Rajaiyan Gowrishankar Rajaiyan
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: