-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.22
-
None
This is a clone of issue OCPBUGS-78192. The following is the description of the original issue:
—
This is a clone of issue OCPBUGS-78191. The following is the description of the original issue:
—
The disruption monitoring test disruption/metrics-api connection/reused should be available throughout the test fails intermittently in serial jobs due to the upstream Kubernetes [sig-node] NoExecuteTaintManager Multiple Pods [Serial] test.
Root Cause
The NoExecuteTaintManager test creates 2 test pods, lets the scheduler place them on worker nodes, then applies a NoExecute taint to whichever nodes those pods land on. All pods on those nodes that lack a matching toleration are evicted — including metrics-server replicas.
With 3 worker nodes and metrics-server running 2 replicas (with anti-affinity) on 2 of them, there is a ~22% probability (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted simultaneously. When this happens, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 Service Unavailable for ~25-30 seconds until replacement pods become ready.
Why the P99 Threshold Doesn't Help
The disruption threshold is calculated as the P99 of historical disruption for similar jobs over the past 3 weeks, plus a 5s grace period. However, serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter the NoExecuteTaintManager test. This results in a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.
This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations.
Fix
Filter out disruption intervals that overlap with the NoExecuteTaintManager test execution window so that this expected disruption is not counted against the threshold. This follows the existing pattern used for TopologyAwareHintsDisabled events during the same test.
Example Failures
- blocks
-
OCPBUGS-78194 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods
-
- New
-
- clones
-
OCPBUGS-78192 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods
-
- New
-
- is blocked by
-
OCPBUGS-78192 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods
-
- New
-
- is cloned by
-
OCPBUGS-78194 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods
-
- New
-
- links to