Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-76389

test-kubevirt-cnv-4.21-operator-ocs: virt-handler rollout timeout flakiness with 3 workers

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • CNV v4.19.z, CNV v4.20.z, CNV v4.21.z
    • CNV QE DevOps
    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None

      Problem

      The test job test-kubevirt-cnv-4.21-operator-ocs is experiencing flakiness due to virt-handler rollout taking ~85 seconds on 3 workers, which is close to the 120-second timeout threshold.

      Test run: https://jenkins-csb-cnvqe-main.dno.corp.redhat.com/job/test-kubevirt-cnv-4.21-operator-ocs/test_results_analyzer/

      Note it affects at least 3 versions: 4.19+

      Context

      • The job runs as part of the kubevirt-t1 scheduled job
      • It shares a single cluster with all kubevirt-t1 lanes
      • The tests are explicitly written against an upstream environment that has a single worker
      • Simply increasing the timeout doesn't solve the root cause - it just pushes the problem to the next assertion that depends on worker/virt-handler operations completing within a certain time

      Options Discussed

      Option 1: Reduce worker count

      • Pros: Matches upstream test environment (single worker)
      • Cons: Requires running the test job separately from t1-scheduled (not sharing cluster with other lanes)
      • Status: Preferred by test developers (dsionov, lyarwood)

      Option 2: Increase timeout

      • Pros: Minimal changes required
      • Cons: Doesn't address root cause - tests written for single worker environment
      • Status: Not recommended by test developers

      Option 3: Mark nodes during operator lane setup

      • Description: Make only a single node schedulable, mark others as unschedulable
      • Pros: Could simplify implementation while keeping job in t1-scheduled
      • Status: Needs investigation

      Action Items

      • [ ] Investigate feasibility of Option 3 (marking nodes as unschedulable)
      • [ ] Determine where to run this test job if Option 1 is chosen (separate from t1-scheduled)
      • [ ] Implement chosen solution
      • [ ] Verify test stability after fix

      Stakeholders

      • Reporter: lbednar
      • Test Developers: dsionov, lyarwood

              dkeler@redhat.com Daniel Keler
              lbednar@redhat.com Lukas Bednar
              Daniel Sionov, Lee Yarwood
              Daniel Keler Daniel Keler
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: