Uploaded image for project: 'OpenShift Cloud'
  1. OpenShift Cloud
  2. OCPCLOUD-1733

Improve spot test resiliency to insufficient capacity

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • None
    • None
    • None
    • CLOUD Sprint 227, CLOUD Sprint 228

      Background

      The spot test started failing far more often than we would expect, reviewing the test failures this appears to be caused by an insufficient capacity on the infrastructure provider.

      We should improve this test to make sure that we can get capacity and then ideally reduce the flakes.

      Some options:

      • use multiple machinesets across failure domains to get at least 1 spot instance
      • Try failure domains in turn until one works
      • Try multiple instance types
      • Do the clouds have a way to tell us there is no capacity?

      Steps

      • Identify improvements we can make to this test
      • Implement the improvements
      • Verify that the test is passing again, and consistently

      Stakeholders

      • Cluster Infra

      Definition of Done

      • Spot tests are re-enabled on Azure and AWS
      • Spot test pass when one failure domain has no capacity for the requested instance type
      • Docs
      • N/A
      • Testing
      • N/A

              rmanak@redhat.com Radek Manak
              joelspeed Joel Speed
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: