-
Story
-
Resolution: Done
-
Normal
-
None
-
Quality / Stability / Reliability
-
0.42
-
False
-
-
False
-
None
-
-
CNV I/U Operators Sprint 282
-
None
Test
Path: tests/install_upgrade_operators/pod_security/test_pod_security_audit_log.py::test_cnv_pod_security_violation_audit_logs
Polarion: CNV-9115
Issue
Test fixture pod_security_violations_apis_calls fails during setup when retrieving node audit logs. The subprocess.getoutput() call in get_node_audit_log_entries() has no timeout and blocks for ~99 seconds when the oc command hangs, preventing the retry mechanism from working effectively.
Root Cause: subprocess.getoutput() blocking call consumes the entire 30-second @retry timeout before returning, leaving no time for retries.
Error:
TimeoutExpiredError: Timed Out: 30
Function: utilities.infra.get_node_audit_log_entries
Last exception: error: http2: server sent GOAWAY and closed the connection
Evidence
- Jenkins Build 7
- Timeline: subprocess hung for 99s (06:38:42 → 06:40:21) before HTTP/2 GOAWAY error
- @retry timeout (30s total) expired while subprocess was still blocked
- Historical: Test passed in builds 3-4, first failure in build 7 (33% fail rate)
- Related fixes: PR #2750 (
CNV-72975) and PR #2368 (CNV-70167) fixed similar issues in test_deprecated_apis_in_audit_logs
Proposed Fix
File: utilities/infra.py:839-863
Replace subprocess.getoutput() with subprocess.run() with timeout:
@retry(
wait_timeout=TIMEOUT_30SEC,
sleep=TIMEOUT_10SEC,
exceptions_dict={RuntimeError: [], subprocess.TimeoutExpired: []},
)
def get_node_audit_log_entries(log, node, log_entry):
# Patterns to match errors that should trigger a retry
error_patterns_list = [
r"^\s*error:",
r"Unhandled Error.*couldn't get current server API group list.*i/o timeout",
r".*read tcp.*connection reset by peer",
]
error_patterns = re.compile("|".join(f"({pattern})" for pattern in error_patterns_list))
result = subprocess.run(
f"{OC_ADM_LOGS_COMMAND} {node} {AUDIT_LOGS_PATH}/{log} | grep {shlex.quote(log_entry)}",
shell=True,
capture_output=True,
text=True,
timeout=10 # Allow 2 retries within 30s total timeout
)
lines = result.stdout.splitlines()
has_errors = any(error_patterns.search(line) for line in lines)
if has_errors:
if any(line.strip().startswith("404 page not found") for line in lines):
LOGGER.warning(f"Skipping {log} check as it was rotated:\n{lines}")
return True, []
LOGGER.warning(f"oc command failed for node {node}, log {log}:\n{lines}")
raise RuntimeError
return True, lines
Key Changes:
- Add subprocess.TimeoutExpired to exceptions_dict for @retry to handle
- Replace subprocess.getoutput() with subprocess.run(timeout=10)
- 10-second timeout per attempt allows 2 retries within 30-second total timeout
- Normal audit log fetches complete in <5 seconds; 10s is sufficient safety margin
Validation
- Run test_cnv_pod_security_violation_audit_logs 10+ times on similar cluster
- Verify subprocess timeout triggers retry on slow/hung oc commands
- Monitor logs for retry attempts when timeouts occur
- Confirm test passes when oc command succeeds on retry
Additional Context
- 34 audit log files processed per test run
- Most files process in seconds; this prevents single hung command from failing entire test
- Pattern follows fixes in
CNV-72975andCNV-70167for related audit log tests - HTTP/2 GOAWAY errors indicate server-side connection closure; retry is appropriate
Impact
- Frequency: Intermittent (1/3 executions failed)
- Jobs affected: test-pytest-cnv-4.21-iuo-ocs
- CI impact: False failures when oc commands hang due to transient network/server issues
- Classification: Flaky test - infrastructure-related timeout, not product defect
- clones
-
CNV-72975 [stabilization]stabilize test_deprecated_apis_in_audit_logs 3
-
- Closed
-