Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55643

Metal jobs timing out due to lack of capacity

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • Metal Platform 270, Metal Platform 271, Metal Platform 272, Metal Platform 273, Metal Platform 274
    • 5
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Context: 

       

      Metal has allocated 90 boskos leases; which through OFCIR redirects to either IBM Cloud, internal infrastructure, or Equinix cloud instances.  This is not enough; we're frequently hitting capacity and having 0 free leases. 90 improved the situation, but I think 105 is the minimum to run the current load.

      Alternative approach would be to spread out jobs more.  There are 14 jobs per nightly stream, many of those informers could be moved to daily/twice-daily crons which could run middle of the night US eastern time which would spread out the load. We'd not lose regression protection as it would still be monitored by component readiness, and we'd smooth the lease utilization more.

      Increasing job timeouts is not an acceptable solution, it delays getting release payloads.  If a metal job takes 6h and it needs a retry, 12 hours is too long.  

              rh-ee-tdomnesc Tudor Domnescu
              stbenjam Stephen Benjamin
              None
              None
              Jad Haj Yahya Jad Haj Yahya
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: