Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-9258

Pod fails to become ready. crio fails to change cgroup of probes

XMLWordPrintable

    • Important
    • OCPNODE Sprint 233 (Blue), OCPNODE Sprint 234 (Blue), OCPNODE Sprint 235 (Blue), OCPNODE Sprint 236 (Blue), OCPNODE Sprint 237 (Green)
    • 5
    • Rejected
    • Unspecified
    • If docs needed, set a value
    • Customer Escalated
    • Hide
      9/21: just seeking to close on this being in 4.14 then move on
      8/14: pending input/response from field re: 4.12 (DM/PP); KNIECO-7503
      Show
      9/21: just seeking to close on this being in 4.14 then move on 8/14: pending input/response from field re: 4.12 (DM/PP); KNIECO-7503

      Description of problem:
      The OLM registry-server container fails to reach the "Ready" state.

      1. oc get pod -n openshift-marketplace
        NAME READY STATUS RESTARTS AGE
        marketplace-operator-7749b7db8d-br5p8 1/1 Running 4 (27h ago) 28h
        rh-du-operators-ml5pl 0/1 Running 0 27h

      Conditions on the pod show:
      status:
      conditions:

      • lastProbeTime: null
        lastTransitionTime: "2022-05-05T15:56:21Z"
        status: "True"
        type: Initialized
      • lastProbeTime: null
        lastTransitionTime: "2022-05-05T15:56:21Z"
        message: 'containers with unready status: [registry-server]'
        reason: ContainersNotReady
        status: "False"
        type: Ready
      • lastProbeTime: null
        lastTransitionTime: "2022-05-05T15:56:21Z"
        message: 'containers with unready status: [registry-server]'
        reason: ContainersNotReady
        status: "False"
        type: ContainersReady
      • lastProbeTime: null
        lastTransitionTime: "2022-05-05T15:56:21Z"
        status: "True"
        type: PodScheduled
        containerStatuses:
      • containerID: cri-o://bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c
        image: e24-h01-000-r640.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index:v4.9
        imageID: e24-h01-000-r640.rdu2.scalelab.redhat.com:5000/olm-mirror/redhat-operator-index@sha256:86efa7af19dfaa7afe0f3469250ad6101c4eed44c7366e3628e7e865834dc43e
        lastState: {}
        name: registry-server
        ready: false
        restartCount: 0
        started: true
        state:
        running:
        startedAt: "2022-05-05T15:56:36Z"

      Journal logs on the node show failures to put the readiness and liveness probe PIDs into cgroup.proc for the container:
      May 06 19:06:41 sno00251 bash[26314]: E0506 19:06:41.901014 26314 remote_runtime.go:704] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = command error: time=\"2022-05-06T19:06:41Z\" level=error msg=\"exec failed: unable to start container process: error adding pid 3616061 to cgroups: failed to write 3616061: open /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1d577c2_09dc_4859_aada_0a157e0b07f0.slice/crio-bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c.scope/cgroup.procs: no such file or directory\"\n, stdout: , stderr: , exit code -1" containerID="bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c" cmd=[grpc_health_probe -addr=:50051]

      May 06 19:06:41 sno00251 bash[26314]: E0506 19:06:41.901135 26314 prober.go:118] "Probe errored" err="rpc error: code = Unknown desc = command error: time=\"2022-05-06T19:06:41Z\" level=error msg=\"exec failed: unable to start container process: error adding pid 3616061 to cgroups: failed to write 3616061: open /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1d577c2_09dc_4859_aada_0a157e0b07f0.slice/crio-bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c.scope/cgroup.procs: no such file or directory\"\n, stdout: , stderr: , exit code -1" probeType="Liveness" pod="openshift-marketplace/rh-du-operators-ml5pl" podUID=e1d577c2-09dc-4859-aada-0a157e0b07f0 containerName="registry-server"

      May 06 19:06:41 sno00251 bash[26314]: E0506 19:06:41.907801 26314 remote_runtime.go:704] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = command error: time=\"2022-05-06T19:06:41Z\" level=error msg=\"exec failed: unable to start container process: error adding pid 3616065 to cgroups: failed to write 3616065: open /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1d577c2_09dc_4859_aada_0a157e0b07f0.slice/crio-bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c.scope/cgroup.procs: no such file or directory\"\n, stdout: , stderr: , exit code -1" containerID="bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c" cmd=[grpc_health_probe -addr=:50051]

      May 06 19:06:41 sno00251 bash[26314]: E0506 19:06:41.907938 26314 prober.go:118] "Probe errored" err="rpc error: code = Unknown desc = command error: time=\"2022-05-06T19:06:41Z\" level=error msg=\"exec failed: unable to start container process: error adding pid 3616065 to cgroups: failed to write 3616065: open /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1d577c2_09dc_4859_aada_0a157e0b07f0.slice/crio-bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c.scope/cgroup.procs: no such file or directory\"\n, stdout: , stderr: , exit code -1" probeType="Readiness" pod="openshift-marketplace/rh-du-operators-ml5pl" podUID=e1d577c2-09dc-4859-aada-0a157e0b07f0 containerName="registry-server"

      The crio-bff4c3... directory does not exist:
      [root@sno00251 core]# ls -l /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1d577c2_09dc_4859_aada_0a157e0b07f0.slice/crio-bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c.scope/
      ls: cannot access '/sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1d577c2_09dc_4859_aada_0a157e0b07f0.slice/crio-bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c.scope/': No such file or directory

      [root@sno00251 core]# ls -l /sys/fs/cgroup/systemd/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pode1d577c2_09dc_4859_aada_0a157e0b07f0.slice/
      total 0
      rw-rr-. 1 root root 0 May 6 19:07 cgroup.clone_children
      rw-rr-. 1 root root 0 May 6 19:07 cgroup.procs
      drwxr-xr-x. 2 root root 0 May 5 15:56 crio-conmon-bff4c347d3fc6a20064926fdfd1ea3c76e039c56205c6b282d3b6c8e2f13233c.scope
      rw-rr-. 1 root root 0 May 6 19:07 notify_on_release
      rw-rr-. 1 root root 0 May 6 19:07 tasks

      Version-Release number of selected component (if applicable): 4.10.13

      How reproducible: 6 out of ~2200 clusters deployed in scale testing have this signature.

      Steps to Reproduce:
      The OLM registry-server pod is created by automated (rapid) manipulation of the catalogsources
      1. Disable default sources in OperatorHub CR
      2. Create new CatalogSource pointing to disconnected registry
      3. Create subscriptions making use of the new CatalogSource

      Actual results: CatalogSource remains in "TRANSIENT_FAILURE" state.

      Expected results: CatalogSource becomes ready.

      Additional info:

            kolyshkin Kirill Kolyshkin
            rhn-support-imiller Ian Miller
            Sunil Choudhary Sunil Choudhary
            Red Hat Employee
            Kirill Kolyshkin
            Harshal Patil, Peter Hunt
            Votes:
            0 Vote for this issue
            Watchers:
            32 Start watching this issue

              Created:
              Updated:
              Resolved: