Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12, 4.11
Component/s: Multi-Arch / ARM
Labels:

Severity:
Critical
Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

[adistefa: Updated description with the ongoing investigation outcomes. The old description is being kept below]

Version-Release number of selected component (if applicable):

4.11.z-aarch64, for any z

Scenario/How reproducible

The issue is always reproducible in the following scenario:

- 3 masters m6g.xlarge 
- 2 workers 
- 1 tainted worker with either m6g.xlarge, m6g.2xlarge, or m6g.4xlarge as instanceType. 
- Using the payload that I'm attaching herewith, consisting of a namespace, an ImageStream, and a deployment with a pod made of 10 containers that sleep.

Steps to reproduce

1. Set the nodeSelector for the requiredAffinity (and tolerations, if taints are used) to make the pods land in a single worker. 
2. oc apply -f deployment.yaml 
3. oc project my-project
4. oc scale deployment/my-deployment --replicas=45 # or more

Change the replicas parameter so that the tainted worker gets up to 472 containers regardless of the chosen instance type (sometimes I got more containers, but still around that number, +- 10).

You can look at the total number of containers with:

oc debug node/my-worker
chroot /host
watch 'echo $(( $(crictl ps | wc -l) - 1 )) - $(find /var/run/crio -type l ! -readable | wc -l)'

You will see the number of containers (left) and the number of broken links (right). The number of broken links will start to increase linearly when we reach a number of total containers in a node that is greater than 472 (+- 10 in my tests). This is considered more a symptom of the issue.

oc debug node/my-worker
chroot /host
watch 'echo $(( $(crictl ps | wc -l) - 1 ))'

The node's journal and the events for the failed-to-create containers' pods report:

Error: container create failed: 
time="2022-11-11T20:30:20Z" level=error msg="runc create failed: unable 
to start container process: unable to init seccomp: error loading 
seccomp filter into kernel: error loading seccomp filter: errno 524"

[[ OLD DESCRIPTION ]]

~~Description of problem:~~

-When update a 4.11 arm64(05_aarch64_IPI on AWS & Private cluster & FIPS on & OVN & Etcd Encryption) cluster to 4.12, image-registry pods failed to start with error "runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524" which blocked the upgrade process-

-10-19 22:16:00.802      Message:               Available: The registry has minimum availability
10-19 22:16:00.802  NodeCADaemonAvailable: The daemon set node-ca has available replicas
10-19 22:16:00.802  ImagePrunerAvailable: Pruner CronJob has been created
10-19 22:16:00.802      Reason:                MinimumAvailability
10-19 22:16:00.802      Status:                True
10-19 22:16:00.802      Type:                  Available
10-19 22:16:00.802      Last Transition Time:  2022-10-19T13:35:48Z
10-19 22:16:00.802      Message:               Progressing: The deployment has not completed
10-19 22:16:00.802  NodeCADaemonProgressing: The daemon set node-ca is deployed
10-19 22:16:00.802      Reason:                DeploymentNotCompleted
10-19 22:16:00.802      Status:                True
10-19 22:16:00.802      Type:                  Progressing
10-19 22:16:00.802      Last Transition Time:  2022-10-19T13:37:48Z
10-19 22:16:00.802      Message:               Degraded: Registry deployment has timed out progressing: ReplicaSet "image-registry-658cd9b654" has timed out progressing.
10-19 22:16:00.802      Reason:                ProgressDeadlineExceeded
10-19 22:16:00.802      Status:                True
10-19 22:16:00.802      Type:                  Degraded
10-19 22:16:00.802    Extension:               <nil> -
-
10-19 22:16:03.025 38m Warning Failed pod/image-registry-658cd9b654-fcnrh Error: container create failed: time="2022-10-19T13:37:41Z" level=error msg="runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524" 10-19 22:16:03.025 15m Warning Failed pod/image-registry-658cd9b654-fcnrh (combined from similar events): Error: container create failed: time="2022-10-19T14:00:50Z" level=error msg="runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524"

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-arm64-2022-10-19-063757 to 4.12.0-0.nightly-arm64-2022-10-18-153953

How reproducible:

not always

Steps to Reproduce:

1. upgrade 4.11.0-0.nightly-arm64-2022-10-19-063757 cluster to 4.12.0-0.nightly-arm64-2022-10-18-153953 
2.
3.

Actual results:

Image registry pods failed to start on 4.12

Expected results:

Image registry should upgrade successfully

Additional info:

must-gather log https://drive.google.com/file/d/1SAC82YC-g7s8OiqnBMptf4DVyp6YsEKw/view?usp=sharing

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

deployment-1.yaml
3 kB
2022/11/14 9:35 AM
create_container_failure_kubelet.log
104 kB
2022/11/16 7:12 PM

duplicates

OCPBUGS-6981 error 524 from seccomp(2) when trying to load filter [rhel-8.6.0.z]

Closed

is duplicated by

OCPBUGS-708 UpdatingKubeStateMetricsFailed before Upgrade

Closed

OCPBUGS-2302 4.11 upgrade to 4.12, prometheus-operator-admission-webhook pod is failed to start up due to "error loading seccomp filter into kernel: error loading seccomp filter: errno 524"

Closed

OCPBUGS-1882 runc create failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524

Closed

relates to

RUN-1668 Impact: 4.11 upgrade to 4.12, prometheus-operator-admission-webhook pod is failed to start up due to "error loading seccomp filter into kernel: error loading seccomp filter: errno 524"

Closed

links to

[BZ2140163] error 524 from seccomp(2) when trying to load filter

KCS 7030968: Error loading seccomp filter into kernel: errno 524

(2 links to)

Assignee:: Jeff Young

Reporter:: XiuJuan Wang

QA Contact:: Alessandro Di Stefano

Votes:: 1 Vote for this issue

Watchers:: 24 Start watching this issue

Created:: 2022/10/20 10:37 AM

Updated:: 2023/09/19 8:17 AM

Resolved:: 2023/02/03 9:16 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates