Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48617

Overloaded azure-2 CI cluster causing serial job timeouts

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      operator conditions etcd

      Significant regression detected.
      Fishers Exact probability of a regression: 99.95%.
      Test pass rate dropped from 97.85% to 86.36%.

      Sample (being evaluated) Release: 4.18
      Start Time: 2025-01-13T00:00:00Z
      End Time: 2025-01-20T12:00:00Z
      Success Rate: 86.36%
      Successes: 19
      Failures: 3
      Flakes: 0

      Base (historical) Release: 4.17
      Start Time: 2024-09-01T00:00:00Z
      End Time: 2024-10-01T23:59:59Z
      Success Rate: 97.85%
      Successes: 91
      Failures: 2
      Flakes: 0

      View the test details report for additional context.

      After further analysis it was found this is caused by extremely long boskos lease delays in the range of 1-2 hours, at which point the serial suite does not have enough time to complete.

      Example:
      https://storage.googleapis.com/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-serial/1880322835826610176/build-log.txt
      INFO[2025-01-17T18:43:18Z] Acquiring leases for test e2e-azure-ovn-serial: [azure-2-quota-slice]
      INFO[2025-01-17T19:49:52Z] Acquired 1 lease(s) for azure-2-quota-slice: [centralus--azure-2-quota-slice-26]

      The dashboard shows this azure-2 account is often maxed at it's 57 cluster limit (select azure-2 in the top panel): https://grafana-route-ci-grafana.apps.ci.l2s4.p1.openshiftapps.com/d/628a36ebd9ef30d67e28576a5d5201fd/boskos-dashboard?orgId=1&from=now-7d&to=now

      First option would be to rebalance azure jobs across available clusters. There appears to be a tool

              nmoraiti Nikolaos Moraitis
              ppawlows@redhat.com Pawel Pawlowski
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: