Loading...

Type: Story
Resolution: Done
Priority: Normal
Fix Version/s: CNV v4.21.0
Affects Version/s: None
Component/s: CNV Install, Upgrade and Operators
Labels:
- automation_stabilization

Activity Type:
Quality / Stability / Reliability
Story Points:
0.42
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
CNV-68136
Component Fix Version(s):
None
Git Pull Request:
https://github.com/RedHatQE/openshift-virtualization-tests/pull/3043, https://github.com/RedHatQE/openshift-virtualization-tests/pull/3093
Market:

Sprint:
CNV I/U Operators Sprint 282

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Test

Path: tests/install_upgrade_operators/pod_security/test_pod_security_audit_log.py::test_cnv_pod_security_violation_audit_logs
Polarion: CNV-9115

Issue

Test fixture pod_security_violations_apis_calls fails during setup when retrieving node audit logs. The subprocess.getoutput() call in get_node_audit_log_entries() has no timeout and blocks for ~99 seconds when the oc command hangs, preventing the retry mechanism from working effectively.

Root Cause: subprocess.getoutput() blocking call consumes the entire 30-second @retry timeout before returning, leaving no time for retries.

Error:

        TimeoutExpiredError: Timed Out: 30
        Function: utilities.infra.get_node_audit_log_entries
        Last exception: error: http2: server sent GOAWAY and closed the connection

Evidence

Jenkins Build 7
Timeline: subprocess hung for 99s (06:38:42 → 06:40:21) before HTTP/2 GOAWAY error
@retry timeout (30s total) expired while subprocess was still blocked
Historical: Test passed in builds 3-4, first failure in build 7 (33% fail rate)
Related fixes: PR #2750 (~~CNV-72975~~) and PR #2368 (~~CNV-70167~~) fixed similar issues in test_deprecated_apis_in_audit_logs

Proposed Fix

File: utilities/infra.py:839-863

Replace subprocess.getoutput() with subprocess.run() with timeout:

        @retry(
            wait_timeout=TIMEOUT_30SEC,
            sleep=TIMEOUT_10SEC,
            exceptions_dict={RuntimeError: [], subprocess.TimeoutExpired: []},
        )
        def get_node_audit_log_entries(log, node, log_entry):
            # Patterns to match errors that should trigger a retry
            error_patterns_list = [
                r"^\s*error:",
                r"Unhandled Error.*couldn't get current server API group list.*i/o timeout",
                r".*read tcp.*connection reset by peer",
            ]
            error_patterns = re.compile("|".join(f"({pattern})" for pattern in error_patterns_list))

            result = subprocess.run(
                f"{OC_ADM_LOGS_COMMAND} {node} {AUDIT_LOGS_PATH}/{log} | grep {shlex.quote(log_entry)}",
                shell=True,
                capture_output=True,
                text=True,
                timeout=10  # Allow 2 retries within 30s total timeout
            )
            lines = result.stdout.splitlines()

            has_errors = any(error_patterns.search(line) for line in lines)
            if has_errors:
                if any(line.strip().startswith("404 page not found") for line in lines):
                    LOGGER.warning(f"Skipping {log} check as it was rotated:\n{lines}")
                    return True, []
                LOGGER.warning(f"oc command failed for node {node}, log {log}:\n{lines}")
                raise RuntimeError
            return True, lines

Key Changes:

Add subprocess.TimeoutExpired to exceptions_dict for @retry to handle
Replace subprocess.getoutput() with subprocess.run(timeout=10)
10-second timeout per attempt allows 2 retries within 30-second total timeout
Normal audit log fetches complete in <5 seconds; 10s is sufficient safety margin

Validation

Run test_cnv_pod_security_violation_audit_logs 10+ times on similar cluster
Verify subprocess timeout triggers retry on slow/hung oc commands
Monitor logs for retry attempts when timeouts occur
Confirm test passes when oc command succeeds on retry

Additional Context

34 audit log files processed per test run
Most files process in seconds; this prevents single hung command from failing entire test
Pattern follows fixes in ~~CNV-72975~~ and ~~CNV-70167~~ for related audit log tests
HTTP/2 GOAWAY errors indicate server-side connection closure; retry is appropriate

Impact

Frequency: Intermittent (1/3 executions failed)
Jobs affected: test-pytest-cnv-4.21-iuo-ocs
CI impact: False failures when oc commands hang due to transient network/server issues
Classification: Flaky test - infrastructure-related timeout, not product defect

clones

CNV-72975 [stabilization]stabilize test_deprecated_apis_in_audit_logs 3

Closed

Details

Description

Test

Issue

Evidence

Proposed Fix

Validation

Additional Context

Impact

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates