-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.14.z
-
Critical
-
No
-
Rejected
-
False
-
-
Fix a bug where isolated CPUs were not load balanced correctly when using crun, causing latency spikes when new containers are created with load balancing disabled
-
Bug Fix
-
11/30: green, fcast 4.14.6
-
Description of problem:
We are seeing large latency spikes on 4.14.1 running our oslat benchmark container which has guaranteed cpus and load balancing disabled. pod annotations cpu-load-balancing.crio.io: "disable" cpu-quota.crio.io: "disable" irq-load-balancing.crio.io: "disable" Latency spike <...>-230128 [004] dn.h... 11021.702404: reschedule_entry: vector=253 <...>-230128 [004] dn.h... 11021.702404: reschedule_exit: vector=253 <...>-230128 [004] dn..2.. 11021.702445: sched_stat_runtime: comm=oslat pid=230128 runtime=245140588 [ns] vruntime=0 [ns] <...>-230128 [004] ....... 11021.702461: tracing_mark_write: oslat: Trace threshold (30 us) triggered with 47 us! I see this when the spike occurs. Originally the isolated cpus are in a null sched domain, but we temporarily see a domain containing all cpus then it corrects itself. [11021.526793] root domain span: 0-111 (max cpu_capacity = 1024) [11021.557410] root domain span: 0-1,56-57 (max cpu_capacity = 1024) [11021.566883] root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024) Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-111 (max cpu_capacity = 1024) Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-1,56-57 (max cpu_capacity = 1024) Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024) This corresponds with a container restart. Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.273993627Z" level=info msg="Started container" PID=236066 containerID=3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d description=openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer sandboxID=a8e54a58e3b65d4db2ad506ebf2c2d0567b65a7db81f95b050d50746e00cc4af Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.265150750Z" level=info msg="Starting container: 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d" id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.263858337Z" level=info msg="Created container 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.026785889Z" level=info msg="Removed container f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930232281Z" level=warning msg="Allowed annotations are specified for workload []" Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930058972Z" level=info msg="Creating container: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928685402Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928412204Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926784447Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926545776Z" level=info msg="Removing container: f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926431322Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus We has a similar issue addressed by https://github.com/cri-o/cri-o/pull/7208/ Note this is with crun. I did not run myself on 4.14.0 but our QE team did not seeing the issue on the GA release.
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
100%
Steps to Reproduce:
1. Spin up/down a burstable pod 2. 3.
Actual results:
Expected results:
Additional info:
- clones
-
OCPBUGS-22901 Isolated CPUs included in sched domain on container creation
- Verified
- links to
-
RHBA-2023:7682 OpenShift Container Platform 4.14.z bug fix update
-
RHBA-2023:7831 OpenShift Container Platform 4.14.z bug fix update
-
RHBA-2024:0050 OpenShift Container Platform 4.14.z bug fix update