[TRT-1559] Investigate why prow job loading is often very slow

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: 4.16.0
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Target Version:

4.16.0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

In looking at a component readiness test page we see some failures that take a long time to load:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992 (I noticed that this one resulted in messages asking me to restart chrome)
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1767279909555671040
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1766663255406678016
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1765279833048223744

We'd like to understand why it takes a long time to load these jobs and possible take some action to remediate as much of that slowness as possible.

Taking a long time to load prow jobs will make our TRT tools seem unusable and might make it difficult for managers to inspect Component Readiness failures which would slow down getting them resolved.

Some idea of what to look at:

see if the file size of the jobs is any bigger now than before esp. for runs with a lot of failures
see if the recent change that cuts the size of the intervals down is still working as expected
compare the file size of a passing run vs. one with a lot of failures

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

4.6G-prow-job-tab.png
536 kB
2024/03/12 6:24 PM

links to

openshift/origin#28654: TRT-1559:Remove two sets of logs to bring build-log.txt size down

Dennis Periquet added a comment - 2024/03/27 10:39 AM - edited

PR is merged and in 4.16.0-0.ci-2024-03-26-162223 (with passing payload).

Looked at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade/1772660995244691456 which is one of the failing jobs (that ran for 3h30m) in https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-azure-sdn-upgrade-4.16-minor-release-openshift-release-analysis-aggregator/1772660998604328960 and although the log file is still 31M, it does not contain the 3 logs removed by the PR.

Looked at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade/1772660990232498176 (3h40m failure), 30M file but there are no event debug logs here either.

Dennis Periquet added a comment - 2024/03/27 10:39 AM - edited PR is merged and in 4.16.0-0.ci-2024-03-26-162223 (with passing payload). Looked at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade/1772660995244691456 which is one of the failing jobs (that ran for 3h30m) in https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-azure-sdn-upgrade-4.16-minor-release-openshift-release-analysis-aggregator/1772660998604328960 and although the log file is still 31M, it does not contain the 3 logs removed by the PR. Looked at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade/1772660990232498176 (3h40m failure), 30M file but there are no event debug logs here either.

Dennis Periquet added a comment - 2024/03/15 1:15 PM - edited

PR to remove event logs to reduce build-log.txt size: https://github.com/openshift/origin/pull/28654

Dennis Periquet added a comment - 2024/03/15 1:15 PM - edited PR to remove event logs to reduce build-log.txt size: https://github.com/openshift/origin/pull/28654

Dennis Periquet added a comment - 2024/03/13 3:06 PM

it's hard for me to believe that a 38M file can consume 4.6G. So I wonder if there's just some leaky process running on the build-log.txt file.

Dennis Periquet added a comment - 2024/03/13 3:06 PM it's hard for me to believe that a 38M file can consume 4.6G. So I wonder if there's just some leaky process running on the build-log.txt file.

Dennis Periquet added a comment - 2024/03/12 6:57 PM

This is the one that takes a long time to load and often results in chrome asking your to restart:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992

When that job actually loads, you can see that the tab takes 4.6G:

Upon loading that page, you will see the spinning Kubernetes logos on all the subframes. If you compress all of them (except the one for build-log), you will see the one for the build-log.txt finishing last. In this case, build-log.txt is 38M.

As an aside, I mention in a slack thread that there are cases where a build-log.txt shows as 2.8M on the GCS bucket link (e.g., see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/28650/pull-ci-openshift-origin-master-e2e-metal-ipi-sdn/1767551429960011776). But then, if you download that file, it's about 10x bigger.

Anyway, I believe it's the big buid-log.txt files that slow down the prow job loads (much more so than the interval charts).

If you run these lines on the build-log.txt file, you will see that certain log messages contribute about > 60% of the lines:

$ cat build-log.txt |grep 'level=info msg="processed event"'|wc -l
   21447

$ cat build-log.txt |grep 'level=info msg="resulting interval locator"'|wc -l
   21447

$ cat build-log.txt |grep 'level=info msg="resulting interval message"'|wc -l
   21447

$ cat build-log.txt |grep -v 'level=info msg="processed event'|grep -v 'level=info msg="resulting interval locator"'|grep -v 'level=info msg="resulting interval message"' > small_build-log.txt

$ ls -lh build-log.txt
-rw-r--r--  1 dperique  wheel    32M Mar 12 14:42 build-log.txt

$ ls -lh small_build-log.txt
-rw-r--r--  1 dperique  wheel   3.3M Mar 12 14:54 small_build-log.txt

(base) schubert:x dperique$ wc -l build-log.txt 
   93834 build-log.txt

here is where the logs are generated.

Dennis Periquet added a comment - 2024/03/12 6:57 PM This is the one that takes a long time to load and often results in chrome asking your to restart: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992 When that job actually loads, you can see that the tab takes 4.6G: Upon loading that page, you will see the spinning Kubernetes logos on all the subframes. If you compress all of them (except the one for build-log), you will see the one for the build-log.txt finishing last. In this case, build-log.txt is 38M. As an aside, I mention in a slack thread that there are cases where a build-log.txt shows as 2.8M on the GCS bucket link (e.g., see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/28650/pull-ci-openshift-origin-master-e2e-metal-ipi-sdn/1767551429960011776 ). But then, if you download that file, it's about 10x bigger. Anyway, I believe it's the big buid-log.txt files that slow down the prow job loads (much more so than the interval charts). If you run these lines on the build-log.txt file, you will see that certain log messages contribute about > 60% of the lines: $ cat build-log.txt |grep 'level=info msg= "processed event" ' |wc -l 21447 $ cat build-log.txt |grep 'level=info msg= "resulting interval locator" ' |wc -l 21447 $ cat build-log.txt |grep 'level=info msg= "resulting interval message" ' |wc -l 21447 $ cat build-log.txt |grep -v 'level=info msg= "processed event' |grep -v 'level=info msg=" resulting interval locator "' |grep -v 'level=info msg=" resulting interval message"' > small_build-log.txt $ ls -lh build-log.txt -rw-r--r-- 1 dperique wheel 32M Mar 12 14:42 build-log.txt $ ls -lh small_build-log.txt -rw-r--r-- 1 dperique wheel 3.3M Mar 12 14:54 small_build-log.txt (base) schubert:x dperique$ wc -l build-log.txt 93834 build-log.txt here is where the logs are generated.

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Dennis Periquet added a comment - 2024/03/27 10:39 AM, Edited by Dennis Periquet - 2024/03/27 10:46 AM

Expand comment: Dennis Periquet added a comment - 2024/03/27 10:39 AM, Edited by Dennis Periquet - 2024/03/27 10:46 AM

Collapse comment: Dennis Periquet added a comment - 2024/03/15 1:15 PM, Edited by Dennis Periquet - 2024/03/27 10:31 AM

Expand comment: Dennis Periquet added a comment - 2024/03/15 1:15 PM, Edited by Dennis Periquet - 2024/03/27 10:31 AM

Collapse comment: Dennis Periquet added a comment - 2024/03/13 3:06 PM

Expand comment: Dennis Periquet added a comment - 2024/03/13 3:06 PM

Collapse comment: Dennis Periquet added a comment - 2024/03/12 6:57 PM

Expand comment: Dennis Periquet added a comment - 2024/03/12 6:57 PM

People

Dates