Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22901

Isolated CPUs included in sched domain on container creation

XMLWordPrintable

    • Critical
    • No
    • OCPNODE Sprint 245 (Green), OCPNODE Sprint 246 (Green), OCPNODE Sprint 247 (Green)
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Previously, on a Single-node OpenShift that has guaranteed CPUs and where Interrupt Request (IRQ) load balancing is disabled, large latency spikes could occur at container startup. With this update, the issue no longer occurs.
    • Release Note Not Required
    • In Progress

      Description of problem:

      We are seeing large latency spikes on 4.14.1 running our oslat benchmark container which has guaranteed cpus and load balancing disabled.
      
      pod annotations
      cpu-load-balancing.crio.io: "disable"
      cpu-quota.crio.io: "disable"
      irq-load-balancing.crio.io: "disable"
       
      Latency spike
      <...>-230128  [004] dn.h... 11021.702404: reschedule_entry: vector=253
      <...>-230128  [004] dn.h... 11021.702404: reschedule_exit: vector=253
      <...>-230128  [004] dn..2.. 11021.702445: sched_stat_runtime: comm=oslat pid=230128 runtime=245140588 [ns] vruntime=0 [ns]
      <...>-230128  [004] ....... 11021.702461: tracing_mark_write: oslat: Trace threshold (30 us) triggered with 47 us! 
      
      I see this when the spike occurs. Originally the isolated cpus are in a null sched domain, but we temporarily see a domain containing all cpus then it corrects itself. 
      
      [11021.526793] root domain span: 0-111 (max cpu_capacity = 1024)
      [11021.557410] root domain span: 0-1,56-57 (max cpu_capacity = 1024)
      [11021.566883] root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024)
      
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-111 (max cpu_capacity = 1024)
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 0-1,56-57 (max cpu_capacity = 1024)
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com kernel: root domain span: 3,5,7,9,11,13,15,17-55,59,61,63,65,67,69,71,73-111 (max cpu_capacity = 1024)
      
      This corresponds with a container restart.
      
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.273993627Z" level=info msg="Started container" PID=236066 containerID=3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d description=openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer sandboxID=a8e54a58e3b65d4db2ad506ebf2c2d0567b65a7db81f95b050d50746e00cc4af
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.265150750Z" level=info msg="Starting container: 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d" id=07a2a660-14f7-46b2-aae9-6d1538f699e8 name=/runtime.v1.RuntimeService/StartContainer
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.263858337Z" level=info msg="Created container 3eea5b90fa994ca33daf8117a6d06531688edd2867ec35d51db2a9885c91e51d: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer
      Nov 02 23:46:14 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:14.026785889Z" level=info msg="Removed container f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930232281Z" level=warning msg="Allowed annotations are specified for workload []"
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.930058972Z" level=info msg="Creating container: openshift-controller-manager-operator/openshift-controller-manager-operator-f4c6d88f4-vrfd2/openshift-controller-manager-operator" id=d6e4153e-3649-48ea-af7a-d181372a5c5a name=/runtime.v1.RuntimeService/CreateContainer
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928685402Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.928412204Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=536a0adb-f228-4288-95e0-96d74fb4979a name=/runtime.v1.ImageService/ImageStatus
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926784447Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:5bf2ecfb8ee518fe71c095f772400a0cc0c384e46cb9f7cb3bb2ecc8c608af72,RepoTags:[],RepoDigests:[registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe],Size_:536605263,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{},},Pinned:false,},Info:map[string]string{},}" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926545776Z" level=info msg="Removing container: f3f05fb7e5f1776597a1fc3d141cbaedf2956ec4afbe6b0b2e94a1a11f9ff836" id=9b7c1a46-5ba3-4b71-be3d-4727eb3886e5 name=/runtime.v1.RuntimeService/RemoveContainer
      Nov 02 23:46:13 helix54.lab.eng.rdu2.redhat.com bash[8934]: time="2023-11-02 23:46:13.926431322Z" level=info msg="Checking image status: registry.kni-qe-59.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:b9673775751fde65503b8fbac964efc8c05f5262f1e55b5658a9a71983412ffe" id=048a2d4d-890f-43f0-b6f0-603293aff0b8 name=/runtime.v1.ImageService/ImageStatus
      
      We has a similar issue addressed by https://github.com/cri-o/cri-o/pull/7208/ 
      
      Note this is with crun. 
      
      I did not run myself on 4.14.0 but our QE team did not seeing the issue on the GA release.

      Version-Release number of selected component (if applicable):

      4.14.1

      How reproducible:

      100%

      Steps to Reproduce:

      1. Spin up/down a burstable pod
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

            pehunt@redhat.com Peter Hunt
            browsell@redhat.com Brent Rowsell
            Min Li Min Li
            Shane Lovern Shane Lovern
            Votes:
            0 Vote for this issue
            Watchers:
            19 Start watching this issue

              Created:
              Updated: