Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22901

Isolated CPUs included in sched domain on container creation

XMLWordPrintable

    • Critical
    • No
    • OCPNODE Sprint 245 (Green), OCPNODE Sprint 246 (Green), OCPNODE Sprint 247 (Green)
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Fix a bug where isolated CPUs were not load balanced correctly when using crun, causing latency spikes when new containers are created with load balancing disabled
    • Bug Fix

      Description of problem:

      We are seeing large latency spikes on 4.14.1 running our oslat benchmark container which has guaranteed cpus and load balancing disabled.
      
      pod annotations
      cpu-load-balancing.crio.io: "disable"
      cpu-quota.crio.io: "disable"
      irq-load-balancing.crio.io: "disable"
       
      Latency spike
      <...>-230128  [004] dn.h... 11021.702404: reschedule_entry: vector=253
      <...>-230128  [004] dn.h... 11021.702404: reschedule_exit: vector=253
      <...>-230128  [004] dn..2.. 11021.702445: sched_stat_runtime: comm=oslat pid=230128 runtime=245140588 [ns] vruntime=0 [ns]
      <...>-230128  [004] ....... 11021.702461: tracing_mark_write: oslat: Trace threshold (30 us) triggered with 47 us! 
      
      I see this when the spike occurs. Originally the isolated cpus are in a null sched domain, but we temporarily see a domain containing all cpus then it corrects itself. 
      
      [11021.526793] root domain span: 0-111 (max cpu_capacity = 1024)
      [11021.557410] root domain span: 0-1,56-57 (max cpu_capacity = 1024)
      [11021.566883] root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024)
      
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-111 (max cpu_capacity = 1024)
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-1,56-57 (max cpu_capacity = 1024)
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024)
      
      This corresponds with a container restart.
      
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.273993627Z" level=info msg="Started container" PID=236066 containerID=3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d description=openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer sandboxID=a8e54a58e3b65d4db2ad506ebf2c2d0567b65a7db81f95b050d50746e00cc4af
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.265150750Z" level=info msg="Starting container: 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d" id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.263858337Z" level=info msg="Created container 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.026785889Z" level=info msg="Removed container f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930232281Z" level=warning msg="Allowed annotations are specified for workload []"
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930058972Z" level=info msg="Creating container: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928685402Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928412204Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926784447Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926545776Z" level=info msg="Removing container: f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926431322Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus
      
      We has a similar issue addressed by https://github.com/cri-o/cri-o/pull/7208/ 
      
      Note this is with crun. 
      
      I did not run myself on 4.14.0 but our QE team did not seeing the issue on the GA release.

      Version-Release number of selected component (if applicable):

      4.14.1

      How reproducible:

      100%

      Steps to Reproduce:

      1. Spin up/down a burstable pod
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

            pehunt@redhat.com Peter Hunt
            browsell@redhat.com Brent Rowsell
            Min Li Min Li
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated: