Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.22
Component/s: Test Framework
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.21.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue OCPBUGS-78191. The following is the description of the original issue:
—
The disruption monitoring test disruption/metrics-api connection/reused should be available throughout the test fails intermittently in serial jobs due to the upstream Kubernetes [sig-node] NoExecuteTaintManager Multiple Pods [Serial] test.

Root Cause

The NoExecuteTaintManager test creates 2 test pods, lets the scheduler place them on worker nodes, then applies a NoExecute taint to whichever nodes those pods land on. All pods on those nodes that lack a matching toleration are evicted — including metrics-server replicas.

With 3 worker nodes and metrics-server running 2 replicas (with anti-affinity) on 2 of them, there is a ~22% probability (2/9) that both test pods land on the metrics-server nodes, causing both replicas to be evicted simultaneously. When this happens, the metrics API (an aggregated API proxied through kube-apiserver) returns 503 Service Unavailable for ~25-30 seconds until replacement pods become ready.

Why the P99 Threshold Doesn't Help

The disruption threshold is calculated as the P99 of historical disruption for similar jobs over the past 3 weeks, plus a 5s grace period. However, serial jobs are a small fraction of total runs in the historical data, so the P99 is dominated by non-serial jobs that never encounter the NoExecuteTaintManager test. This results in a very low baseline (~0-1s) with 5s of grace, which cannot absorb a 25-30s deterministic outage.

This is not a product bug — the NoExecuteTaintManager test is intentionally designed to evict pods without tolerations.

Fix

Filter out disruption intervals that overlap with the NoExecuteTaintManager test execution window so that this expected disruption is not counted against the threshold. This follows the existing pattern used for TopologyAwareHintsDisabled events during the same test.

Example Failures

blocks

OCPBUGS-78193 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods

clones

OCPBUGS-78191 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods

MODIFIED

is blocked by

OCPBUGS-78191 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods

MODIFIED

is cloned by

OCPBUGS-78193 Disruption tests fail due to NoExecuteTaintManager serial test evicting metrics-server pods

links to

openshift/origin#30857: [release-4.21] OCPBUGS-78192: Exclude disruption during NoExecuteTaintManager serial tests

Assignee:: Stephen Benjamin

Reporter:: Technical Release Team Openshift

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2026/03/10 8:21 PM

Updated:: 2026/03/11 3:42 PM

Details

Description

Root Cause

Why the P99 Threshold Doesn't Help

Fix

Example Failures

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates