Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2637

[ARM64][4.11.0+] Containers are stuck in CreateError with 'error loading seccomp filter: errno 524'

XMLWordPrintable

    • Critical
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      [adistefa: Updated description with the ongoing investigation outcomes. The old description is being kept below]

      Version-Release number of selected component (if applicable):

      4.11.z-aarch64, for any z
      

      Scenario/How reproducible

      The issue is always reproducible in the following scenario:
      
      - 3 masters m6g.xlarge 
      - 2 workers 
      - 1 tainted worker with either m6g.xlarge, m6g.2xlarge, or m6g.4xlarge as instanceType. 
      - Using the payload that I'm attaching herewith, consisting of a namespace, an ImageStream, and a deployment with a pod made of 10 containers that sleep. 

       

      Steps to reproduce

      1. Set the nodeSelector for the requiredAffinity (and tolerations, if taints are used) to make the pods land in a single worker. 
      2. oc apply -f deployment.yaml 
      3. oc project my-project
      4. oc scale deployment/my-deployment --replicas=45 # or more
      
      Change the replicas parameter so that the tainted worker gets up to 472 containers regardless of the chosen instance type (sometimes I got more containers, but still around that number, +- 10).
      
      You can look at the total number of containers with:
      
      oc debug node/my-worker
      chroot /host
      watch 'echo $(( $(crictl ps | wc -l) - 1 )) - $(find /var/run/crio -type l ! -readable | wc -l)'
      
      You will see the number of containers (left) and the number of broken links (right). The number of broken links will start to increase linearly when we reach a number of total containers in a node that is greater than 472 (+- 10 in my tests). This is considered more a symptom of the issue.

       

      oc debug node/my-worker
      chroot /host
      watch 'echo $(( $(crictl ps | wc -l) - 1 ))'

      The node's journal and the events for the failed-to-create containers' pods report:

      Error: container create failed: 
      time="2022-11-11T20:30:20Z" level=error msg="runc create failed: unable 
      to start container process: unable to init seccomp: error loading 
      seccomp filter into kernel: error loading seccomp filter: errno 524"
       

       

      [[ OLD DESCRIPTION ]]

       

      Description of problem:

      -When update a 4.11 arm64(05_aarch64_IPI on AWS & Private cluster & FIPS on & OVN & Etcd Encryption) cluster to 4.12, image-registry pods failed to start with error "runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524" which blocked the upgrade process-
      -10-19 22:16:00.802      Message:               Available: The registry has minimum availability
      10-19 22:16:00.802  NodeCADaemonAvailable: The daemon set node-ca has available replicas
      10-19 22:16:00.802  ImagePrunerAvailable: Pruner CronJob has been created
      10-19 22:16:00.802      Reason:                MinimumAvailability
      10-19 22:16:00.802      Status:                True
      10-19 22:16:00.802      Type:                  Available
      10-19 22:16:00.802      Last Transition Time:  2022-10-19T13:35:48Z
      10-19 22:16:00.802      Message:               Progressing: The deployment has not completed
      10-19 22:16:00.802  NodeCADaemonProgressing: The daemon set node-ca is deployed
      10-19 22:16:00.802      Reason:                DeploymentNotCompleted
      10-19 22:16:00.802      Status:                True
      10-19 22:16:00.802      Type:                  Progressing
      10-19 22:16:00.802      Last Transition Time:  2022-10-19T13:37:48Z
      10-19 22:16:00.802      Message:               Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-658cd9b654" has timed out progressing.
      10-19 22:16:00.802      Reason:                ProgressDeadlineExceeded
      10-19 22:16:00.802      Status:                True
      10-19 22:16:00.802      Type:                  Degraded
      10-19 22:16:00.802    Extension:               <nil> -
      -
      10-19 22:16:03.025 38m Warning Failed pod/image-registry-658cd9b654-fcnrh Error: container create failed: time="2022-10-19T13:37:41Z" level=error msg="runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524" 10-19 22:16:03.025 15m Warning Failed pod/image-registry-658cd9b654-fcnrh (combined from similar events): Error: container create failed: time="2022-10-19T14:00:50Z" level=error msg="runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524"  

      Version-Release number of selected component (if applicable):

      4.11.0-0.nightly-arm64-2022-10-19-063757 to 4.12.0-0.nightly-arm64-2022-10-18-153953

      How reproducible:

      not always

      Steps to Reproduce:

      1. upgrade 4.11.0-0.nightly-arm64-2022-10-19-063757 cluster to 4.12.0-0.nightly-arm64-2022-10-18-153953 
      2.
      3.
      

      Actual results:

      Image registry pods failed to start on 4.12

      Expected results:

      Image registry should upgrade successfully

      Additional info:

      must-gather log https://drive.google.com/file/d/1SAC82YC-g7s8OiqnBMptf4DVyp6YsEKw/view?usp=sharing 

      -

        1. create_container_failure_kubelet.log
          104 kB
          Alessandro Di Stefano
        2. deployment-1.yaml
          3 kB
          Alessandro Di Stefano

            jeffdyoung Jeff Young
            rh-ee-xiuwang XiuJuan Wang
            Alessandro Di Stefano Alessandro Di Stefano
            Votes:
            1 Vote for this issue
            Watchers:
            24 Start watching this issue

              Created:
              Updated:
              Resolved: