Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Minor
Fix Version/s: None
Affects Version/s: None
Labels:
- trt-standup

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

The symptom is shown in these logs in this step:

Creating new realtime tuned profile on cluster
tuned.tuned.openshift.io/worker-rt created
waiting for mcp/worker condition=Updating timeout=5m
machineconfigpool.machineconfiguration.openshift.io/worker condition met
waiting for mcp/worker condition=Updated timeout=30m
error: timed out waiting for the condition on machineconfigpools/worker

The symptom occurs on the "periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade" job. We did not observe this symptom happening in other jobs.

This problem started happening on this payload: 4.14.0-0.nightly-2023-04-20-231721 (OS version 414.92.202304201926-0)

The previous payload does not have the problem: 4.14.0-0.nightly-2023-04-19-224500 (OS version 414.92.202304172216-0)

This is a problem blocking 4.14 nightly payloads.

Note the bump in rhcos:

414.92.202304201926
414.92.202304172216

The jobs showing the symptom fail after less than 1.5 hours; jobs that get past the problem fail after (usually) above 2.5 hours.

Here's an aggregated job where all 10 jobs showed the symptom (click on "job-run-summary for aggregated") and you'll see:

periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191289897357312 failure after 1h30m46s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191290732023808 failure after 1h33m26s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191292409745408 failure after 1h23m44s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191293248606208 failure after 1h25m31s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191294104244224 failure after 1h27m14s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191295811325952 failure after 1h27m33s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191296658575360 failure after 1h29m32s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191297480658944 failure after 1h21m35s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191298294353920 failure after 1h22m14s
periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1649191299133214720 failure after 1h21m6s

The step above is trying to apply a tuned configuration and waiting up to 30m for the configuration to be applied. Sometimes the configuration change is not finished before the timeout.

Here's a job showing the symptom.

Here's a job not showing the symptom

We looked in the "openshift-cluster-node-tuning-operator_tuned" and "openshift-cluster-node-tuning-operator_cluster-node-tuning-operator-58f85646cd-8hdsq_cluster-node-tuning-operator" logs.

We also looked at toplevel build-log.txt files in prow jobs for cases where the symptom occurs and not. We observed that the time it takes to apply the tuned config close to 30m which might explain why this is intermittent.

At this point, we are testing what happens if we increase the timeout in this PR. But we don't know why it's taking longer than before.

(update Apr 26, 2023): the timeout increase helps, the PR is merged and the test failures related to this are no longer there. At this point, the urgency of this drops but we can leave it open to keep the investigation going as to the reason for the slowdown.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

trt-982-time-chart.png
404 kB
2023/04/26 4:16 PM

links to

openshift/cluster-node-tuning-operator#632: Revert "OCPNODE-1539: perf profile: add script for preparing cgroups for CPU load balance disabling"

openshift/release#38717: Increase mcp update timeout for tuned config

Assignee:: Unassigned

Reporter:: Dennis Periquet

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/04/24 6:43 PM

Updated:: 2023/07/17 1:28 PM

Resolved:: 2023/07/17 1:28 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates