Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1559

Investigate why prow job loading is often very slow

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • 4.16.0
    • None
    • None
    • False
    • None
    • False

      In looking at a component readiness test page we see some failures that take a long time to load:

      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992 (I noticed that this one resulted in messages asking me to restart chrome)
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1767279909555671040
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1766663255406678016
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-e2e-metal-ipi-ovn-dualstack/1765279833048223744

      We'd like to understand why it takes a long time to load these jobs and possible take some action to remediate as much of that slowness as possible.

      Taking a long time to load prow jobs will make our TRT tools seem unusable and might make it difficult for managers to inspect Component Readiness failures which would slow down getting them resolved.

      Some idea of what to look at:

      • see if the file size of the jobs is any bigger now than before esp. for runs with a lot of failures
      • see if the recent change that cuts the size of the intervals down is still working as expected
      • compare the file size of a passing run vs. one with a lot of failures

            [TRT-1559] Investigate why prow job loading is often very slow

            Dennis Periquet added a comment - - edited

            Dennis Periquet added a comment - - edited PR is merged and in 4.16.0-0.ci-2024-03-26-162223 (with passing payload). Looked at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade/1772660995244691456 which is one of the failing jobs (that ran for 3h30m) in https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-azure-sdn-upgrade-4.16-minor-release-openshift-release-analysis-aggregator/1772660998604328960 and although the log file is still 31M, it does not contain the 3 logs removed by the PR. Looked at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade/1772660990232498176 (3h40m failure), 30M file but there are no event debug logs here either.

            Dennis Periquet added a comment - - edited

            PR to remove event logs to reduce build-log.txt size: https://github.com/openshift/origin/pull/28654

            Dennis Periquet added a comment - - edited PR to remove event logs to reduce build-log.txt size: https://github.com/openshift/origin/pull/28654

            it's hard for me to believe that a 38M file can consume 4.6G. So I wonder if there's just some leaky process running on the build-log.txt file.

            Dennis Periquet added a comment - it's hard for me to believe that a 38M file can consume 4.6G. So I wonder if there's just some leaky process running on the build-log.txt file.

            This is the one that takes a long time to load and often results in chrome asking your to restart:
            https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992

            When that job actually loads, you can see that the tab takes 4.6G:

            Upon loading that page, you will see the spinning Kubernetes logos on all the subframes.  If you compress all of them (except the one for build-log), you will see the one for the build-log.txt finishing last.  In this case, build-log.txt is 38M.

            As an aside, I mention in a slack thread that there are cases where a build-log.txt shows as 2.8M on the GCS bucket link (e.g., see https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/28650/pull-ci-openshift-origin-master-e2e-metal-ipi-sdn/1767551429960011776).  But then, if you download that file, it's about 10x bigger.

            Anyway, I believe it's the big buid-log.txt files that slow down the prow job loads (much more so than the interval charts).

            If you run these lines on the build-log.txt file, you will see that certain log messages contribute about > 60% of the lines:

            $ cat build-log.txt |grep 'level=info msg="processed event"'|wc -l
               21447
            
            $ cat build-log.txt |grep 'level=info msg="resulting interval locator"'|wc -l
               21447
            
            $ cat build-log.txt |grep 'level=info msg="resulting interval message"'|wc -l
               21447
            
            $ cat build-log.txt |grep -v 'level=info msg="processed event'|grep -v 'level=info msg="resulting interval locator"'|grep -v 'level=info msg="resulting interval message"' > small_build-log.txt
            
            $ ls -lh build-log.txt
            -rw-r--r--  1 dperique  wheel    32M Mar 12 14:42 build-log.txt
            
            $ ls -lh small_build-log.txt
            -rw-r--r--  1 dperique  wheel   3.3M Mar 12 14:54 small_build-log.txt
            
            (base) schubert:x dperique$ wc -l build-log.txt 
               93834 build-log.txt
            

            here is where the logs are generated.

            Dennis Periquet added a comment - This is the one that takes a long time to load and often results in chrome asking your to restart: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-ovn-dualstack/1758641985364692992 When that job actually loads, you can see that the tab takes 4.6G: Upon loading that page, you will see the spinning Kubernetes logos on all the subframes.  If you compress all of them (except the one for build-log), you will see the one for the build-log.txt finishing last.  In this case, build-log.txt is 38M. As an aside, I mention in a slack thread that there are cases where a build-log.txt shows as 2.8M on the GCS bucket link (e.g., see  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/28650/pull-ci-openshift-origin-master-e2e-metal-ipi-sdn/1767551429960011776 ).  But then, if you download that file, it's about 10x bigger. Anyway, I believe it's the big buid-log.txt files that slow down the prow job loads (much more so than the interval charts). If you run these lines on the build-log.txt file, you will see that certain log messages contribute about > 60% of the lines: $ cat build-log.txt |grep 'level=info msg= "processed event" ' |wc -l 21447 $ cat build-log.txt |grep 'level=info msg= "resulting interval locator" ' |wc -l 21447 $ cat build-log.txt |grep 'level=info msg= "resulting interval message" ' |wc -l 21447 $ cat build-log.txt |grep -v 'level=info msg= "processed event' |grep -v 'level=info msg=" resulting interval locator "' |grep -v 'level=info msg=" resulting interval message"' > small_build-log.txt $ ls -lh build-log.txt -rw-r--r-- 1 dperique wheel 32M Mar 12 14:42 build-log.txt $ ls -lh small_build-log.txt -rw-r--r-- 1 dperique wheel 3.3M Mar 12 14:54 small_build-log.txt (base) schubert:x dperique$ wc -l build-log.txt 93834 build-log.txt here is where the logs are generated.

              dperique@redhat.com Dennis Periquet
              dperique@redhat.com Dennis Periquet
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: