Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-26072

Node in NotReady state as unified_cgroup_hierarchy=1 are set

    XMLWordPrintable

Details

    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • 1/9: green, backport PR active

    Description

      This is a clone of issue OCPBUGS-19352. The following is the description of the original issue:

      Description of problem:

      In baremetal multinode OCP cluster a node ends up in NotReady state.
      
      On the node there are couple of failed services:
      ● cpuset-configure.service         loaded failed failed Move services to reserved cpuset
      ● on-prem-resolv-prepender.service loaded failed failed Populates resolv.conf according to on-prem IPI needs
      
      journalctl --boot --no-pager -u cpuset-configure.service
      Sep 18 16:57:37 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Starting Move services to reserved cpuset...
      Sep 18 16:57:37 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com cpuset-configure.sh[3014]: /usr/local/bin/cpuset-configure.sh: line 17: /sys/fs/cgroup/cpuset/cpuset.sched_load_balance: Read-only file system
      Sep 18 16:57:38 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: cpuset-configure.service: Main process exited, code=exited, status=1/FAILURE
      Sep 18 16:57:38 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: cpuset-configure.service: Failed with result 'exit-code'.
      Sep 18 16:57:38 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Move services to reserved cpuset.
      
      Sep 18 16:57:52 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Populates resolv.conf according to on-prem IPI needs.
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Starting Populates resolv.conf according to on-prem IPI needs...
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4852]: nameserver 10.47.242.10
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4851]: NM resolv-prepender: Starting download of baremetal runtime cfg image
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:23012b3380ffce706aa8f204cdc26745d8a69b0218150ec3bcb495202694fdab...
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Getting image source signatures
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:916ead524b9e54b9d5534b65534253c02ce66f1d784e683389aa3c4cb4d12389
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:d8190195889efb5333eeec18af9b6c82313edd4db62989bd3a357caca4f13f0e
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:c71d2589fba7989ecd29ea120fe7add01fab70126fc653a863d5844e35ee5403
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:97da74cc6d8fa5d1634eb1760fd1da5c6048619c264c23e62d75f3bf6b8ef5c4
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:d4dc6e74b6ce09e24dc284cc1967451f3dda2d485bc92fc95d24d91f939e4849
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying config sha256:ba2c86ef11c4e341cd0870b6d5b7ad39aa39724389d9d2dfead4ea3d75582071
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Writing manifest to image destination
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Storing signatures
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: ba2c86ef11c4e341cd0870b6d5b7ad39aa39724389d9d2dfead4ea3d75582071
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4851]: NM resolv-prepender: Download of baremetal runtime cfg image completed
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4863]: Your kernel does not support pids limit capabilities or the cgroup is not mounted. PIDs limit discarded.
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4863]: Error: OCI runtime error: runc: runc create failed: mountpoint for devices not found
      Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: on-prem-resolv-prepender.service: Main process exited, code=exited, status=127/n/a
      
      When checking CGroup config:
      oc describe node.config
      Name:         cluster
      Namespace:
      Labels:       <none>
      Annotations:  include.release.openshift.io/ibm-cloud-managed: true
                    include.release.openshift.io/self-managed-high-availability: true
                    include.release.openshift.io/single-node-developer: true
                    release.openshift.io/create-only: true
      API Version:  config.openshift.io/v1
      Kind:         Node
      Metadata:
        Creation Timestamp:  2023-09-18T15:27:44Z
        Generation:          3
        Owner References:
          API Version:     config.openshift.io/v1
          Kind:            ClusterVersion
          Name:            version
          UID:             c62da215-6526-4306-8fc6-035612c8605e
        Resource Version:  91518
        UID:               cf2189ba-cd69-45e9-868c-7c2589decb25
      Spec:
        Cgroup Mode:  v1
      Events:         <none> 

       

      Version-Release number of selected component (if applicable):

      4.14.0-rc.1

      How reproducible:

      so far 100%

      Steps to Reproduce:

      1. Deploy baremetal multinode cluster with GitOps-ZTP workflow
      2.
      3.
      

      Actual results:

      While all policies report Complaint state some configs are still being applied:
      
      oc get mcp
      NAME       CONFIG                                               UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
      ht100gb    rendered-ht100gb-572f5aef443a21b21a8c5cfe816708e2    False     True       False      2              0                   0                     0                      77m
      master     rendered-master-3c44ec28c389693028ad2cc6b74741ca     True      False      False      3              3                   3                     0                      103m
      standard   rendered-standard-1942568110455a377b735e15f18c7ba8   True      False      False      2              2                   2                     0                      77m
      worker     rendered-worker-033d4f0a2568efce241d02a2c54ab88e     True      False      False      0              0                   0                     0                      103m

      Expected results:

      All nodes are in Ready state

      Additional info:

       

      Attachments

        Issue Links

          Activity

            People

              aos-node@redhat.com Node Team Bot Account
              openshift-crt-jira-prow OpenShift Prow Bot
              Min Li Min Li
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: