Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-74779

[stabilization]stabilize test_deprecated_apis_in_audit_logs 4

XMLWordPrintable

    • CNV I/U Operators Sprint 282
    • None

      Test

      Path: tests/install_upgrade_operators/pod_security/test_pod_security_audit_log.py::test_cnv_pod_security_violation_audit_logs
      Polarion: CNV-9115

      Issue

      Test fixture pod_security_violations_apis_calls fails during setup when retrieving node audit logs. The subprocess.getoutput() call in get_node_audit_log_entries() has no timeout and blocks for ~99 seconds when the oc command hangs, preventing the retry mechanism from working effectively.

      Root Cause: subprocess.getoutput() blocking call consumes the entire 30-second @retry timeout before returning, leaving no time for retries.

      Error:

              TimeoutExpiredError: Timed Out: 30
              Function: utilities.infra.get_node_audit_log_entries
              Last exception: error: http2: server sent GOAWAY and closed the connection
              

      Evidence

      • Jenkins Build 7
      • Timeline: subprocess hung for 99s (06:38:42 → 06:40:21) before HTTP/2 GOAWAY error
      • @retry timeout (30s total) expired while subprocess was still blocked
      • Historical: Test passed in builds 3-4, first failure in build 7 (33% fail rate)
      • Related fixes: PR #2750 (CNV-72975) and PR #2368 (CNV-70167) fixed similar issues in test_deprecated_apis_in_audit_logs

      Proposed Fix

      File: utilities/infra.py:839-863

      Replace subprocess.getoutput() with subprocess.run() with timeout:

              @retry(
                  wait_timeout=TIMEOUT_30SEC,
                  sleep=TIMEOUT_10SEC,
                  exceptions_dict={RuntimeError: [], subprocess.TimeoutExpired: []},
              )
              def get_node_audit_log_entries(log, node, log_entry):
                  # Patterns to match errors that should trigger a retry
                  error_patterns_list = [
                      r"^\s*error:",
                      r"Unhandled Error.*couldn't get current server API group list.*i/o timeout",
                      r".*read tcp.*connection reset by peer",
                  ]
                  error_patterns = re.compile("|".join(f"({pattern})" for pattern in error_patterns_list))
      
                  result = subprocess.run(
                      f"{OC_ADM_LOGS_COMMAND} {node} {AUDIT_LOGS_PATH}/{log} | grep {shlex.quote(log_entry)}",
                      shell=True,
                      capture_output=True,
                      text=True,
                      timeout=10  # Allow 2 retries within 30s total timeout
                  )
                  lines = result.stdout.splitlines()
      
                  has_errors = any(error_patterns.search(line) for line in lines)
                  if has_errors:
                      if any(line.strip().startswith("404 page not found") for line in lines):
                          LOGGER.warning(f"Skipping {log} check as it was rotated:\n{lines}")
                          return True, []
                      LOGGER.warning(f"oc command failed for node {node}, log {log}:\n{lines}")
                      raise RuntimeError
                  return True, lines
              

      Key Changes:

      • Add subprocess.TimeoutExpired to exceptions_dict for @retry to handle
      • Replace subprocess.getoutput() with subprocess.run(timeout=10)
      • 10-second timeout per attempt allows 2 retries within 30-second total timeout
      • Normal audit log fetches complete in <5 seconds; 10s is sufficient safety margin

      Validation

      1. Run test_cnv_pod_security_violation_audit_logs 10+ times on similar cluster
      2. Verify subprocess timeout triggers retry on slow/hung oc commands
      3. Monitor logs for retry attempts when timeouts occur
      4. Confirm test passes when oc command succeeds on retry

      Additional Context

      • 34 audit log files processed per test run
      • Most files process in seconds; this prevents single hung command from failing entire test
      • Pattern follows fixes in CNV-72975 and CNV-70167 for related audit log tests
      • HTTP/2 GOAWAY errors indicate server-side connection closure; retry is appropriate

      Impact

      • Frequency: Intermittent (1/3 executions failed)
      • Jobs affected: test-pytest-cnv-4.21-iuo-ocs
      • CI impact: False failures when oc commands hang due to transient network/server issues
      • Classification: Flaky test - infrastructure-related timeout, not product defect

              rlobillo Ramón Lobillo
              rlobillo Ramón Lobillo
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: