Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32031

oslat 45us spike 1h run on 4.14.20

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.14.z
    • Telco Performance
    • None
    • Important
    • No
    • CNF Ran Sprint 252, CNF Ran Sprint 253, CNF Ran Sprint 254
    • 3
    • False
    • Hide

      None

      Show
      None
    • 4/24: Fix is available (RHEL-9148 ) - needs backport to 9.2.0.z

      Description of problem:

          On SNO spoke with telco DU profile applied, oslat reported 45us latency spike on a 1h run
      

      Version-Release number of selected component (if applicable):

          4.14.20
      
      local-storage-operator.v4.14.0-202403261739 
      cluster-logging.v5.9.0                      
      packageserver                               
      ptp-operator.v4.14.0-202403222237           
      sriov-network-operator.v4.14.0-202402270139 
      sriov-fec.v2.8.0                            
      

      How reproducible:

          always

      Steps to Reproduce:

          1. Deploy DU node
          2. Run OSLAT test pod
      
      [INFO] oslat git hash: ea82509d664d72992068c3a1fc41f9a66e2c3f99
      [INFO] oslat image sha: sha256:4b568365d42fd6198aafa6d7ac61a2a6dc842521acb739f05647d5f9b36cca40
      [INFO] Pod spec
      apiVersion: v1
      kind: Pod
      metadata:
        name: oslat0
        annotations:
          # Disable CPU balance with CRIO
          irq-load-balancing.crio.io: "disable"
          cpu-load-balancing.crio.io: "disable"
          cpu-quota.crio.io: "disable"
        labels:
          app: oslat
      spec:
        restartPolicy: Never
        runtimeClassName: performance-openshift-node-performance-profile
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                   - oslat
              topologyKey: "kubernetes.io/hostname"
        containers:
        - args:
          name: container-perf-tools
          image: registry.kni-qe-22.kni.eng.rdu2.dc.redhat.com:5000/ran-test/oslat
          # Force to fetch latest test  image
          imagePullPolicy: Always
          resources:
            limits:
              cpu: 16
              memory: 2Gi
            requests:
              cpu: 16
              memory: 2Gi
          env:
          - name: tool
            value: "oslat"
          - name: RUNTIME_SECONDS
            value: 1h
          - name: INITIAL_DELAY_SEC
            value: "30"
          - name: PRIO
            value: "1"
          - name: delay
            value: "60"
          - name: manual
            value: "n"
          - name: TRACE_THRESHOLD
            value: "20"
          securityContext:
            privileged: true
          volumeMounts:
          - mountPath: /dev/cpu_dma_latency
            name: cstate
        nodeSelector:
          node-role.kubernetes.io/master: ""
        volumes:
        - name: cstate
          hostPath:
            path: /dev/cpu_dma_latency

      Actual results:

          oslat: Trace threshold (20 us) triggered on cpu 41 with 45 us!
      

      Expected results:

              All samples below 20us
      

      Additional info:

          trace file: http://registry.kni-qe-22.kni.eng.rdu2.dc.redhat.com:8080/images/sno.kni-qe-12.lab.eng.rdu2.redhat.com-oslat-kernel-trace.txt

            rh-ee-cshulyup Costa Shulyupin
            mcornea@redhat.com Marius Cornea
            Marius Cornea Marius Cornea
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: