-
Story
-
Resolution: Done
-
Normal
-
None
-
None
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
5
-
None
-
None
-
CLOUD Sprint 227, CLOUD Sprint 228
Background
The spot test started failing far more often than we would expect, reviewing the test failures this appears to be caused by an insufficient capacity on the infrastructure provider.
We should improve this test to make sure that we can get capacity and then ideally reduce the flakes.
Some options:
- use multiple machinesets across failure domains to get at least 1 spot instance
- Try failure domains in turn until one works
- Try multiple instance types
- Do the clouds have a way to tell us there is no capacity?
Steps
- Identify improvements we can make to this test
- Implement the improvements
- Verify that the test is passing again, and consistently
Stakeholders
- Cluster Infra
Definition of Done
- Spot tests are re-enabled on Azure and AWS
- Spot test pass when one failure domain has no capacity for the requested instance type
- Docs
- N/A
- Testing
- N/A