Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.21
Labels:
- trt-incident

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None

Target Version:

4.21
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Since 4.21.0-0.nightly-2025-11-26-014720, SNO upgrade job failed on the following disruption testing.
E.g:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-sin[...]-openshift-release-analysis-aggregator/1993563443491246080
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-sin[...]-openshift-release-analysis-aggregator/1993600285062205440

: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test
: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/reused should be available throughout the test

: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test
{ backend-disruption-name/cache-openshift-api-new-connections connection/new disruption/openshift-tests was unreachable during disruption: for at least 11m45s (maxAllowed=11m28s): P99 from historical data for similar jobs over past 3 weeks: 9m33.51s added an additional 20% of grace

: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/reused should be available throughout the test

{  backend-disruption-name/cache-openshift-api-reused-connections connection/reused disruption/openshift-tests was unreachable during disruption:  for at least 11m45s (maxAllowed=11m27s):
P99 from historical data for similar jobs over past 3 weeks: 9m32.68s
added an additional 20% of grace

compared intervals between 4.21.0-0.nightly-2025-11-26-014720 and 4.21.0-0.nightly-2025-11-25-103346 (a good one), obviously there were more apiserver operator unavailable events happened on 4.21.0-0.nightly-2025-11-26-014720.

ImageStreamImportMode featureGate is enabled on 4.21.0-0.nightly-2025-11-26-014720, it }}newly introduced {{ImageStreamImportMode, that change triggered a new deployment for apiserver operator, because it is a SNO cluster, that would make the disruption testing getting worse.

it is a consequence of a Feature Gate enablement that happens during that update that causes one additional apiserver restart. It is expected to go away once the update does not cross the Feature Gate enablement again.

Per the slack discussion.

TRT is planning to just make the single node check very forgiving

single node disruption monitoring barely makes any sense

We are going to extend the grace to make the disruption tests on SNO cluster to have more }}{{{}forgiving.

links to

openshift/origin#30544: TRT-2454: allow more grace of disruption for SNO cluster

Assignee:: Johnny Liu

Reporter:: Johnny Liu

Need Info From:: None

Contributors:: None

Architect:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/12/01 2:23 AM

Updated:: 2025/12/29 1:20 PM

Resolved:: 2025/12/03 2:05 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates