Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16
Component/s: Performance Addon Operator
Labels:
- ci-blocker
- regression

Severity:
Important
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

    Creating conatinerRuntimeConfig with defaultruntime=crun, while a performanceprofile exists on the cluster, is failing, and the respective node get stuck NotReady,SchedulingDisabled and never recover. 

Note that enabling crun without PP on the cluster is done successfully w/o complications.

kubeletlogs on the hanging node are showing varius errors indicating that the container runtime is down (for the full log please find the link in the first comment):

Jan 10 12:47:26.748835 ip-10-0-31-44 kubenswrapper[2528]:         rpc error: code = Unknown desc = container create failed: writing file `pids.max`: Invalid argument
..
Jan 10 12:10:59.981387 ip-10-0-31-44 kubenswrapper[2192]: I0110 12:10:59.981374    2192 container_manager_linux.go:268] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Jan 10 12:11:00.162038 ip-10-0-31-44 kubenswrapper[2192]: W0110 12:11:00.128927    2192 watcher.go:93] Error while processing event ("/sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice": 0x40000100 == IN_CREATE|IN_ISDIR): readdirent /sys/fs/cgroup/devices/kubepods.slice/kubepods-besteffort.slice: no such file or directory
..

Jan 10 12:50:27.522468 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:27.522401    2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-dns/dns-default-nmbz9" podUID="d045aff3-9fde-49d7-be7c-af5003b73428"
Jan 10 12:50:27.522656 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:27.522576    2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-e2e-loki/loki-promtail-wqc74" podUID="02334c43-6773-4bdd-b6b8-25b9bc537e17"
Jan 10 12:50:28.521840 ip-10-0-31-44 kubenswrapper[2528]: E0110 12:50:28.521732    2528 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?" pod="openshift-network-diagnostics/network-check-target-sdmp7" podUID="0b33815d-3c8a-4224-8ac8-4fa3c1d26add"

Version-Release number of selected component (if applicable):

    4.16

How reproducible:

    always

Steps to Reproduce:

    1.install 4.16 OCP starting  4.16.0-0.nightly-2024-01-03-193825
    2.apply PP with minimal config:

apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  name: manual
spec:
  cpu:
    isolated: "1,3"
    reserved: "0,2"
  nodeSelector:
    node-role.kubernetes.io/worker: ""

    3.apply crun config:

apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
  name: enable-crun
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ''
  containerRuntimeConfig:
    logLevel: debug 
    overlaySize: 8G 
    defaultRuntime: "crun"

Actual results:

    node is hanging at reboot and never recover:
ip-10-0-31-44.ec2.internal     NotReady,SchedulingDisabled   worker                 110m   v1.29.0+e58658f

Expected results:

    applying the runtime config should pass successfully.

Additional info:

   0. This reproduces on BM as well as VM worker nodes.
   1.applying the same runtime config on a cluster without PP encountered no issues.
mustgather and kubeletlogs on the hanging node will be provided in a comment as a gdrive link. 
   2. It only affects 4.16. 
   3. It was reported that when the PP has cgroupv2 enabled, and apply the crun config, there were no complications. PP with cgroupv2 example:

[root@cnfdr32 ~]# oc get performanceprofile/performance -o yaml
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
  annotations:
    performance.openshift.io/ignore-cgroups-version: "true"
  creationTimestamp: "2024-01-10T12:10:01Z"
  finalizers:
  - foreground-deletion
  generation: 36
  name: performance
  resourceVersion: "77140"
  uid: 1d6d8735-9579-4b0d-89b7-ae250aa72c25
spec:
  cpu:
    isolated: 4,6,8,10,12,14,16,18,20,22,9,11,13,15,17,19,21,23,5,7
    reserved: 0,2,1,3
  hugepages:
    defaultHugepagesSize: 1G
    pages:
    - count: 20
      size: 2M
  machineConfigPoolSelector:
    machineconfiguration.openshift.io/role: worker-cnf
  nodeSelector:
    node-role.kubernetes.io/worker-cnf: ""
  numa:
    topologyPolicy: single-numa-node
  realTimeKernel:
    enabled: true
  workloadHints:
    realTime: true

Assignee:: Talor Itzhak

Reporter:: Shereen Haj

QA Contact:: Gowrishankar Rajaiyan

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/01/10 1:22 PM

Updated:: 2024/05/28 3:33 PM

Resolved:: 2024/02/07 11:58 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates