-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.20, 4.21
-
None
-
False
-
-
None
-
Moderate
-
None
-
All
-
None
-
None
-
CNF Compute Sprint 284
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Latency tests are described in our customer documentation here: https://docs.redhat.com/en/documentation/openshift_container_platform/4.21/html/scalability_and_performance/cnf-performing-platform-verification-latency-tests#cnf-measuring-latency_cnf-latency-tests If the LATENCY_TEST_CPUS parameter is not supplied, the test is skipped with the following message: [SKIPPED] Skip the test, the requested number of CPUs should be even to avoid noisy neighbor situation This is due to the following check: https://github.com/openshift/cluster-node-tuning-operator/blame/7916d0fc178a08ff83794f3f5fff9779885340c4/test/e2e/performanceprofile/functests/4_latency/latency.go#L84 I think the code is attempting to use all the available CPUs minus 1, which results in an odd number and the failure to run the test. Note that the example in the documentation for the hwlatdetect test omits the LATENCY_TEST_CPUS parameter, so any customer following these instructions will hit this issue.
Version-Release number of selected component (if applicable):
Seen in OCP 4.20
How reproducible:
Always
Steps to Reproduce:
1. Run a hwlatdetect test following the instructions in the customer documentation referenced above
Actual results:
The test is skipped
Expected results:
The test runs
Additional info:
We'll need to decide on an appropriate default value for LATENCY_TEST_CPUS if it is not supplied. Using all the CPUs that are available is probably not a good idea as we do not recommend running latency tests on all or most of the CPUS in a server. The other option would be to change LATENCY_TEST_CPUS to be a mandatory parameter, but that is a bit unintuitive for the hwlatdetect test, which actually runs the test on ALL CPUs (using a kernel tracer), regardless of how many CPUs the cnf-tests container is using. Also note that the workaround is just to provide the LATENCY_TEST_CPUS parameter to the command. Given that there is a simple workaround I would recommend that we only fix this in the current release. Also note that this will likely require a customer documentation update if we change the default value for LATENCY_TEST_CPUS - the documentation says this: LATENCY_TEST_CPUS: Specifies the number of CPUs that the pod running the latency tests uses. If you do not set the variable, the default configuration includes all isolated CPUs.