-
Bug
-
Resolution: Unresolved
-
Critical
-
4.14.z
-
Critical
-
No
-
OCPNODE Sprint 245 (Green), OCPNODE Sprint 246 (Green), OCPNODE Sprint 247 (Green)
-
3
-
Rejected
-
False
-
-
Previously, on a Single-node OpenShift that has guaranteed CPUs and where Interrupt Request (IRQ) load balancing is disabled, large latency spikes could occur at container startup. With this update, the issue no longer occurs.
-
Release Note Not Required
-
In Progress
-
Description of problem:
We are seeing large latency spikes on 4.14.1 running our oslat benchmark container which has guaranteed cpus and load balancing disabled. pod annotations cpu-load-balancing.crio.io: "disable" cpu-quota.crio.io: "disable" irq-load-balancing.crio.io: "disable" Latency spike <...>-230128 [004] dn.h... 11021.702404: reschedule_entry: vector=253 <...>-230128 [004] dn.h... 11021.702404: reschedule_exit: vector=253 <...>-230128 [004] dn..2.. 11021.702445: sched_stat_runtime: comm=oslat pid=230128 runtime=245140588 [ns] vruntime=0 [ns] <...>-230128 [004] ....... 11021.702461: tracing_mark_write: oslat: Trace threshold (30 us) triggered with 47 us! I see this when the spike occurs. Originally the isolated cpus are in a null sched domain, but we temporarily see a domain containing all cpus then it corrects itself. [11021.526793] root domain span: 0-111 (max cpu_capacity = 1024) [11021.557410] root domain span: 0-1,56-57 (max cpu_capacity = 1024) [11021.566883] root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024) Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-111 (max cpu_capacity = 1024) Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-1,56-57 (max cpu_capacity = 1024) Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024) This corresponds with a container restart. Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.273993627Z" level=info msg="Started container" PID=236066 containerID=3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d description=openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer sandboxID=a8e54a58e3b65d4db2ad506ebf2c2d0567b65a7db81f95b050d50746e00cc4af Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.265150750Z" level=info msg="Starting container: 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d" id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.263858337Z" level=info msg="Created container 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.026785889Z" level=info msg="Removed container f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930232281Z" level=warning msg="Allowed annotations are specified for workload []" Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930058972Z" level=info msg="Creating container: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928685402Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928412204Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926784447Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926545776Z" level=info msg="Removing container: f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926431322Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus We has a similar issue addressed by https://github.com/cri-o/cri-o/pull/7208/ Note this is with crun. I did not run myself on 4.14.0 but our QE team did not seeing the issue on the GA release.
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
100%
Steps to Reproduce:
1. Spin up/down a burstable pod 2. 3.
Actual results:
Expected results:
Additional info:
- is cloned by
-
OCPBUGS-24281 [4.14] Isolated CPUs included in sched domain on container creation
- Closed
- links to