-
Story
-
Resolution: Unresolved
-
Major
-
None
-
None
Test
Primary Failure
Path: tests/observability/virt/test_virt_metrics.py::TestKubevirtVirtOperatorReady
Test: test_metric_kubevirt_virt_operator_ready
Secondary Failure (Root Cause)
Path: tests/observability/storage/test_hpp_observability.py::TestHPPOperatorUpMetric
Test: test_kubevirt_hpp_operator_up_metric[scaled_deployment_scope_class0]
Issue
Test fails during setup with TimeoutExpiredError after 239.84 seconds waiting for virt-operator deployment replicas. This is a cascading failure caused by incomplete teardown of the previous test class.
Sequence of Events
- Test 1 (HPP metric test) scales operators to 0 replicas using class-scoped fixtures
- Test 1 teardown attempts to restore operators but encounters ProtocolError during restoration
- virt-operator is never restored (remains at 0 replicas)
- Test 2 (virt operator ready test) starts immediately with module-scoped fixture
- Fixture initial_virt_operator_replicas calls wait_for_replicas with 240s timeout
- Timeout occurs because virt-operator deployment still at 0 replicas or unstable
Root Cause
Fixture scope mismatch - module-scoped fixture (initial_virt_operator_replicas) evaluates before class-scoped fixtures from previous test complete their teardown. When previous test teardown fails with network error, cluster state is left broken for subsequent tests.
Evidence
Jenkins Build
Job: test-pytest-cnv-4.20-observability-ocs #55
URL: https://jenkins-csb-cnvqe-main.dno.corp.redhat.com/job/test-pytest-cnv-4.20-observability-ocs/55/
Status: UNSTABLE (2 ERRORS)
Date: 2025-12-18
CNV: 4.20.3 (HCO v4.20.3.rhel9-31)
OCP: 4.20.8
Must-Gather Analysis
Collection time: 2025-12-18 04:55:02 UTC
Cluster Events Timeline:
- 04:53:20 - OLM operator scaled down (1→0)
- 04:53:24 - virt-operator scaled down (2→0)
- 04:53:25 - HPP operator scaled down (1→0)
- 04:54:44 - HPP operator scaled up (0→1) - TEARDOWN START
- 04:54:51 - OLM operator scaled up (0→1)
- 04:54:55 - ProtocolError: Connection aborted, RemoteDisconnected
- virt-operator scale up - NO EVENT FOUND (restoration failed)
Error Details
Test 1: urllib3.exceptions.ProtocolError during teardown - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Test 2: timeout_sampler.TimeoutExpiredError: Timed Out: 239.84611916542053
Function: ocp_resources.deployment.wait_for_replicas.lambda: self.instance
Historical Pattern
- Build #55: UNSTABLE (these 2 new failures)
- Builds #54-49: SUCCESS (6 consecutive passes)
- Builds #48,47: UNSTABLE (different tests failed - VM metrics timeouts)
- Classification: NEW FAILURES (first occurrence in build #55)
- Not persistent or flaky - appears after version changes
Known Issues
PR #2155 (Sept 2025) partially fixed this test flakiness but did not address timeout duration or synchronization between fixture scopes.
Proposed Fix
Fix 1 - Add Retry Logic to scale_deployment_replicas (Priority 1)
File: utilities/infra.py
Function: scale_deployment_replicas
Wrap the teardown restoration with retry decorator to handle transient network errors:
Use tenacity retry decorator with 3 attempts, exponential backoff (min 4s, max 30s), retrying on ProtocolError and RemoteDisconnected exceptions. Also increase wait_for_replicas timeout from default to 600 seconds (10 minutes).
This prevents network errors from leaving operators in broken state (0 replicas).
Fix 2 - Increase Timeout in initial_virt_operator_replicas (Priority 1)
File: tests/observability/virt/conftest.py (or similar location)
Fixture: initial_virt_operator_replicas
Change wait_for_replicas timeout from default ~240s to 600 seconds (10 minutes). Add TimeoutSampler wrapper to retry the wait operation with 5-second intervals.
This allows sufficient time for virt-operator to recover from previous test operations.
Fix 3 - Add Module-Level Health Check (Priority 2)
Add module-scoped autouse fixture to verify all CNV operators are healthy before and after test module execution. Check that OLM operator, virt-operator, and HPP operator are all at expected replica counts with all pods Running and Ready.
This creates a barrier between test modules to prevent cascading failures.
Validation
Pre-Fix Validation
- Verify failures reproduce consistently with operator scaling tests
- Run test_kubevirt_hpp_operator_up_metric followed by test_metric_kubevirt_virt_operator_ready
- Confirm timeout occurs when first test teardown encounters any error
Post-Fix Validation
- Apply retry logic to scale_deployment_replicas function
- Apply timeout increase to initial_virt_operator_replicas fixture
- Run full observability test suite 10 times
- Verify pass rate ≥95%
- Inject simulated network error during teardown to verify retry logic works
- Check cluster events to confirm virt-operator restoration completes
Success Criteria
- Zero timeout errors in initial_virt_operator_replicas fixture
- All operator scale-up events present in cluster events after teardown
- Tests pass even when previous test encounters transient network errors
Additional Context
Version Changes Between Last Success and Failure
Build #54 (SUCCESS) vs Build #55 (UNSTABLE):
- HCO Bundle: v4.20.3.rhel9-10 → v4.20.3.rhel9-31
- KubeVirt: v1.6.3-39 → v1.6.3-42
- CDI: v1.63.1-4 → v1.63.1-9
- OCP: 4.20.5 → 4.20.8
- New cluster deployed 2025-12-17
Multiple component versions changed which may have introduced behavioral changes in operator scaling or recovery timing.
Related Components
- HCO-Operator: High-level orchestrator managing CNV operators
- Virt-Operator: Central KubeVirt operator deploying virt-api, virt-controller, virt-handler
- OLM-Operator: Manages operator lifecycle
- HPP-Operator: HostPath Provisioner operator
Fixture Architecture
Class-scoped fixtures (disabled_olm_operator, disabled_virt_operator, scaled_deployment_scope_class) scale operators to 0 replicas during test execution. Module-scoped fixture (initial_virt_operator_replicas) expects stable virt-operator deployment. Scope mismatch allows module fixture to evaluate before class fixtures complete teardown.
Impact
Severity
Medium - Not a product bug, test infrastructure timing issue. No CNV functional regression.
Frequency
New failure (first occurrence). Likely to recur when operator scaling tests encounter network issues.
Affected Tests
- test_metric_kubevirt_virt_operator_ready (direct)
- test_kubevirt_hpp_operator_up_metric (indirect - teardown issue)
- Any test following aggressive operator scaling operations
Jobs Affected
test-pytest-cnv-4.20-observability-ocs (TIER-2)
Team: CNV Install, Upgrade and Operators
Owner: orevah@redhat.com
CI Impact
- Does not block release (observability metrics functioning correctly)
- Affects test reliability and CI signal quality
- May cause false failures when operator restoration encounters transient errors
- Cascading failures make debugging more difficult
Risk Assessment
Likelihood of recurrence: MEDIUM (timing-dependent, network errors unpredictable)
Success probability with fixes: 98-99%
- links to