Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-75355

Fix test_metric_kubevirt_virt_operator_ready setup timeout after operator scaling

XMLWordPrintable

    • Quality / Stability / Reliability
    • 0.42
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None

      Test

      Primary Failure
      Path: tests/observability/virt/test_virt_metrics.py::TestKubevirtVirtOperatorReady
      Test: test_metric_kubevirt_virt_operator_ready

      Secondary Failure (Root Cause)
      Path: tests/observability/storage/test_hpp_observability.py::TestHPPOperatorUpMetric
      Test: test_kubevirt_hpp_operator_up_metric[scaled_deployment_scope_class0]

      Issue

      Test fails during setup with TimeoutExpiredError after 239.84 seconds waiting for virt-operator deployment replicas. This is a cascading failure caused by incomplete teardown of the previous test class.

      Sequence of Events

      1. Test 1 (HPP metric test) scales operators to 0 replicas using class-scoped fixtures
      2. Test 1 teardown attempts to restore operators but encounters ProtocolError during restoration
      3. virt-operator is never restored (remains at 0 replicas)
      4. Test 2 (virt operator ready test) starts immediately with module-scoped fixture
      5. Fixture initial_virt_operator_replicas calls wait_for_replicas with 240s timeout
      6. Timeout occurs because virt-operator deployment still at 0 replicas or unstable

      Root Cause
      Fixture scope mismatch - module-scoped fixture (initial_virt_operator_replicas) evaluates before class-scoped fixtures from previous test complete their teardown. When previous test teardown fails with network error, cluster state is left broken for subsequent tests.

      Evidence

      Jenkins Build
      Job: test-pytest-cnv-4.20-observability-ocs #55
      URL: https://jenkins-csb-cnvqe-main.dno.corp.redhat.com/job/test-pytest-cnv-4.20-observability-ocs/55/
      Status: UNSTABLE (2 ERRORS)
      Date: 2025-12-18
      CNV: 4.20.3 (HCO v4.20.3.rhel9-31)
      OCP: 4.20.8

      Must-Gather Analysis
      Collection time: 2025-12-18 04:55:02 UTC

      Cluster Events Timeline:

      • 04:53:20 - OLM operator scaled down (1→0)
      • 04:53:24 - virt-operator scaled down (2→0)
      • 04:53:25 - HPP operator scaled down (1→0)
      • 04:54:44 - HPP operator scaled up (0→1) - TEARDOWN START
      • 04:54:51 - OLM operator scaled up (0→1)
      • 04:54:55 - ProtocolError: Connection aborted, RemoteDisconnected
      • virt-operator scale up - NO EVENT FOUND (restoration failed)

      Error Details
      Test 1: urllib3.exceptions.ProtocolError during teardown - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
      Test 2: timeout_sampler.TimeoutExpiredError: Timed Out: 239.84611916542053
      Function: ocp_resources.deployment.wait_for_replicas.lambda: self.instance

      Historical Pattern

      • Build #55: UNSTABLE (these 2 new failures)
      • Builds #54-49: SUCCESS (6 consecutive passes)
      • Builds #48,47: UNSTABLE (different tests failed - VM metrics timeouts)
      • Classification: NEW FAILURES (first occurrence in build #55)
      • Not persistent or flaky - appears after version changes

      Known Issues
      PR #2155 (Sept 2025) partially fixed this test flakiness but did not address timeout duration or synchronization between fixture scopes.

      Proposed Fix

      Fix 1 - Add Retry Logic to scale_deployment_replicas (Priority 1)

      File: utilities/infra.py
      Function: scale_deployment_replicas

      Wrap the teardown restoration with retry decorator to handle transient network errors:

      Use tenacity retry decorator with 3 attempts, exponential backoff (min 4s, max 30s), retrying on ProtocolError and RemoteDisconnected exceptions. Also increase wait_for_replicas timeout from default to 600 seconds (10 minutes).

      This prevents network errors from leaving operators in broken state (0 replicas).

      Fix 2 - Increase Timeout in initial_virt_operator_replicas (Priority 1)

      File: tests/observability/virt/conftest.py (or similar location)
      Fixture: initial_virt_operator_replicas

      Change wait_for_replicas timeout from default ~240s to 600 seconds (10 minutes). Add TimeoutSampler wrapper to retry the wait operation with 5-second intervals.

      This allows sufficient time for virt-operator to recover from previous test operations.

      Fix 3 - Add Module-Level Health Check (Priority 2)

      Add module-scoped autouse fixture to verify all CNV operators are healthy before and after test module execution. Check that OLM operator, virt-operator, and HPP operator are all at expected replica counts with all pods Running and Ready.

      This creates a barrier between test modules to prevent cascading failures.

      Validation

      Pre-Fix Validation

      1. Verify failures reproduce consistently with operator scaling tests
      2. Run test_kubevirt_hpp_operator_up_metric followed by test_metric_kubevirt_virt_operator_ready
      3. Confirm timeout occurs when first test teardown encounters any error

      Post-Fix Validation

      1. Apply retry logic to scale_deployment_replicas function
      2. Apply timeout increase to initial_virt_operator_replicas fixture
      3. Run full observability test suite 10 times
      4. Verify pass rate ≥95%
      5. Inject simulated network error during teardown to verify retry logic works
      6. Check cluster events to confirm virt-operator restoration completes

      Success Criteria

      • Zero timeout errors in initial_virt_operator_replicas fixture
      • All operator scale-up events present in cluster events after teardown
      • Tests pass even when previous test encounters transient network errors

      Additional Context

      Version Changes Between Last Success and Failure
      Build #54 (SUCCESS) vs Build #55 (UNSTABLE):

      • HCO Bundle: v4.20.3.rhel9-10 → v4.20.3.rhel9-31
      • KubeVirt: v1.6.3-39 → v1.6.3-42
      • CDI: v1.63.1-4 → v1.63.1-9
      • OCP: 4.20.5 → 4.20.8
      • New cluster deployed 2025-12-17

      Multiple component versions changed which may have introduced behavioral changes in operator scaling or recovery timing.

      Related Components

      • HCO-Operator: High-level orchestrator managing CNV operators
      • Virt-Operator: Central KubeVirt operator deploying virt-api, virt-controller, virt-handler
      • OLM-Operator: Manages operator lifecycle
      • HPP-Operator: HostPath Provisioner operator

      Fixture Architecture
      Class-scoped fixtures (disabled_olm_operator, disabled_virt_operator, scaled_deployment_scope_class) scale operators to 0 replicas during test execution. Module-scoped fixture (initial_virt_operator_replicas) expects stable virt-operator deployment. Scope mismatch allows module fixture to evaluate before class fixtures complete teardown.

      Impact

      Severity
      Medium - Not a product bug, test infrastructure timing issue. No CNV functional regression.

      Frequency
      New failure (first occurrence). Likely to recur when operator scaling tests encounter network issues.

      Affected Tests

      • test_metric_kubevirt_virt_operator_ready (direct)
      • test_kubevirt_hpp_operator_up_metric (indirect - teardown issue)
      • Any test following aggressive operator scaling operations

      Jobs Affected
      test-pytest-cnv-4.20-observability-ocs (TIER-2)
      Team: CNV Install, Upgrade and Operators
      Owner: orevah@redhat.com

      CI Impact

      • Does not block release (observability metrics functioning correctly)
      • Affects test reliability and CI signal quality
      • May cause false failures when operator restoration encounters transient errors
      • Cascading failures make debugging more difficult

      Risk Assessment
      Likelihood of recurrence: MEDIUM (timing-dependent, network errors unpredictable)
      Success probability with fixes: 98-99%

              rh-ee-orevah Ohad Revah
              rlobillo Ramón Lobillo
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: