Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-78192

Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.22
    • Test Framework
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-78191. The following is the description of the original issue:

      The disruption monitoring test disruption/metrics-api connection/reused should be available throughout the test fails intermittently in serial jobs due to the upstream Kubernetes [sig-node] NoExecuteTaintManager Multiple Pods [Serial] test.

      Root Cause

      The NoExecuteTaintManager test creates 2 test pods, lets the scheduler place them on worker nodes, then applies a NoExecute taint to whichever nodes those pods land on. All pods on those nodes that lack a matching toleration are evicted — including metrics-server replicas.

      With 3 worker nodes and metrics-server running 2 replicas (with anti-affinity) on 2 of them, there is a ~22% probability (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted simultaneously. When this happens, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 Service Unavailable for ~25-30 seconds until replacement pods become ready.

      Why the P99 Threshold Doesn't Help

      The disruption threshold is calculated as the P99 of historical disruption for similar jobs over the past 3 weeks, plus a 5s grace period. However, serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter the NoExecuteTaintManager test. This results in a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.

      This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations.

      Fix

      Filter out disruption intervals that overlap with the NoExecuteTaintManager test execution window so that this expected disruption is not counted against the threshold. This follows the existing pattern used for TopologyAwareHintsDisabled events during the same test.

      Example Failures

              stbenjam Stephen Benjamin
              openshift-trt-privileged Technical Release Team Openshift
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: