-
Spike
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
5
-
False
-
-
False
-
-
Background
OpenStack-related CI jobs in openshift/csi-operator (and potentially other repositories) are experiencing failures due to resource pool exhaustion of the openstack-vh-mecha-central-quota-slice boskos lease.
The current pool is limited to 3 concurrent leases, but multiple jobs compete for these resources simultaneously. Jobs that cannot acquire a lease within the ~2.5 hour timeout fail with:
failed to acquire lease for "openstack-vh-mecha-central-quota-slice": resources not found
Scope
Investigate and recommend solutions to reduce or eliminate OpenStack CI job failures caused by lease contention.
Investigation Areas
1. Resource pool sizing
- Determine if OpenStack infrastructure can support additional concurrent clusters
- Evaluate impact of increasing pool size in generate-boskos.py
- Identify cost/capacity constraints
2. Test configuration optimization
- Audit all repositories using openstack-vh-mecha-central cluster profile
- Identify tests that could use run_if_changed filters to reduce trigger frequency
- Evaluate which tests could be converted from presubmit to periodic
3. Job duration analysis
- Profile OpenStack test execution times
- Identify opportunities to reduce test duration and lease hold time
4. Alternative approaches
- Evaluate job prioritization/queuing mechanisms (boskos is problematic in that sense)
- Consider test consolidation to reduce total lease requirements
- Investigate lease timeout configuration options
References
- Failed job example: PR #359 build log
- Boskos config: core-services/prow/02_config/generate-boskos.py
- CI config: ci-operator/config/openshift/csi-operator/openshift-csi-operator-main.yaml
- is related to
-
OSASINFRA-4027 Investigate perma-failure in csi-operator CI jobs
-
- Closed
-