Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4364

[4.10] OSLAT runner uses both sibling threads causing latency spikes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • 4.11.z
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • Bug Fix
    • Hide
      *Cause*: What actions or circumstances cause this bug to present.
      *Consequence*: What happens when the bug presents.
      *Fix*: What was done to fix the bug.
      *Result*: Bug doesn’t present anymore.
      Show
      *Cause*: What actions or circumstances cause this bug to present. *Consequence*: What happens when the bug presents. *Fix*: What was done to fix the bug. *Result*: Bug doesn’t present anymore.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-2596. The following is the description of the original issue:

      This bug is a backport clone of [Bugzilla Bug 2051443](https://bugzilla.redhat.com/show_bug.cgi?id=2051443). The following is the description of the original bug:

      Description of problem:

      CNF tests oslat runner configures oslat to use all available cpus even though some of them belong to the same cpu core. This causes false spikes.

      Version-Release number of selected component (if applicable):

      All

      How reproducible:

      Always, by following documentation on HT enabled nodes

      Steps to Reproduce:
      1. Configure OCP cluster using PAO, keep hyper-threading enabled
      2. Run oslat according to the docs, request 7 cpus
      3. Observe oslat will use one cpu for control thread (OK) and 6 for testing. Those 6 will be for sure threads 0 and 1 from about 3 cpu cores

      https://docs.openshift.com/container-platform/4.9/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-performing-end-to-end-tests-running-oslat

      Actual results:

      On a machine configured like this:

      ```
      cpu:
      reserved: "0,1,32,33"
      isolated: "2-31,34-63"

      ```

      Where cpus 0-31 are threads 0 of cores 0-31 and cpus 32-63 are the secondary threads (thread 1) of the same cores.

      ```
      cat <<EOF > run-cnf-tests.sh
      sudo -E podman run --authfile ./pull_secret.txt -v $(pwd)/:/kubeconfig \
      --dns 10.20.129.82 \
      -e ROLE_WORKER_CNF=master \
      -e CLEAN_PERFORMANCE_PROFILE="false" \
      -e KUBECONFIG=/kubeconfig/kubeconfig \
      -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e LATENCY_TEST_CPUS=7 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
      registry.redhat.com/openshift4/cnf-tests-rhel8:v4.9 \
      /usr/bin/test-run.sh -ginkgo.focus="oslat"
      EOF

                                      1. OUTPUT ######################################
                                        $ oc logs oslat-kwks5
                                        I0203 14:55:43.363056 1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/vmlinuz-4.18.0-305.30.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/0 ip=ens43f0:dhcp root=UUID=efae0a7f-874e-411d-b4ce-a9a51a1622a7 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=2-31,34-63 tuned.non_isolcpus=00000003,00000003 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,2-31,34-63 systemd.cpu_affinity=0,1,32,33 default_hugepagesz=1G hugepagesz=1G hugepages=50 idle=poll rcupdate.rcu_normal_after_boot=0 nohz_full=2-31,34-63
                                        I0203 14:55:43.363322 1 node.go:44] Environment information: kernel version 4.18.0-305.30.1.el8_4.x86_64
                                        I0203 14:55:43.363363 1 main.go:53] Running the oslat command with arguments [--duration 600 --rtprio 1 --cpu-list 3-5,34-36 --cpu-main-thread 2]

      [admin@server poc-sno-01]$ oc logs oslat-kwks5 -f
      I0203 14:55:43.363056 1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/vmlinuz-4.18.0-305.30.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/0 ip=ens43f0:dhcp root=UUID=efae0a7f-874e-411d-b4ce-a9a51a1622a7 rw rootflags=prjqWorkload: no
      Workload mem: 0 (KiB)
      Preheat cores: 6

      Pre-heat for 1 seconds...
      Test starts...
      Test completed.

      ```

      Notice this line:

      Running the oslat command with arguments [--duration 600 --rtprio 1 --cpu-list 3-5,34-36 --cpu-main-thread 2]

      Expected results:

      Only threads 3-5 should be used for the test, the remaining cpus should be left idle.

      Additional info:

      Here is the runner: https://github.com/openshift-kni/cnf-features-deploy/blob/953dcd664f12d39116039e76fead9d83d1d33afb/cnf-tests/pod-utils/oslat-runner/main.go

      Here is the similar runner from the performance team: https://github.com/redhat-nfvpe/container-perf-tools/blob/f641d725ffa694b735561b837abc4219753c93d8/oslat/cmd.sh#L70

              titzhak Talor Itzhak
              openshift-crt-jira-prow OpenShift Prow Bot
              None
              None
              Nikita Kononov Nikita Kononov
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: