-
Task
-
Resolution: Unresolved
-
Undefined
-
None
-
CNV v4.19.z, CNV v4.20.z, CNV v4.21.z
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
None
Problem
The test job test-kubevirt-cnv-4.21-operator-ocs is experiencing flakiness due to virt-handler rollout taking ~85 seconds on 3 workers, which is close to the 120-second timeout threshold.
Note it affects at least 3 versions: 4.19+
Context
- The job runs as part of the kubevirt-t1 scheduled job
- It shares a single cluster with all kubevirt-t1 lanes
- The tests are explicitly written against an upstream environment that has a single worker
- Simply increasing the timeout doesn't solve the root cause - it just pushes the problem to the next assertion that depends on worker/virt-handler operations completing within a certain time
Options Discussed
Option 1: Reduce worker count
- Pros: Matches upstream test environment (single worker)
- Cons: Requires running the test job separately from t1-scheduled (not sharing cluster with other lanes)
- Status: Preferred by test developers (dsionov, lyarwood)
Option 2: Increase timeout
- Pros: Minimal changes required
- Cons: Doesn't address root cause - tests written for single worker environment
- Status: Not recommended by test developers
Option 3: Mark nodes during operator lane setup
- Description: Make only a single node schedulable, mark others as unschedulable
- Pros: Could simplify implementation while keeping job in t1-scheduled
- Status: Needs investigation
Action Items
- [ ] Investigate feasibility of Option 3 (marking nodes as unschedulable)
- [ ] Determine where to run this test job if Option 1 is chosen (separate from t1-scheduled)
- [ ] Implement chosen solution
- [ ] Verify test stability after fix
Stakeholders
- Reporter: lbednar
- Test Developers: dsionov, lyarwood