-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.21
Since 4.21.0-0.nightly-2025-11-26-014720, SNO upgrade job failed on the following disruption testing.
E.g:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-sin[...]-openshift-release-analysis-aggregator/1993563443491246080
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-sin[...]-openshift-release-analysis-aggregator/1993600285062205440
: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test
: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/reused should be available throughout the test
: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test { backend-disruption-name/cache-openshift-api-new-connections connection/new disruption/openshift-tests was unreachable during disruption: for at least 11m45s (maxAllowed=11m28s): P99 from historical data for similar jobs over past 3 weeks: 9m33.51s added an additional 20% of grace
: [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/reused should be available throughout the test
{ backend-disruption-name/cache-openshift-api-reused-connections connection/reused disruption/openshift-tests was unreachable during disruption: for at least 11m45s (maxAllowed=11m27s):
P99 from historical data for similar jobs over past 3 weeks: 9m32.68s
added an additional 20% of grace
compared intervals between 4.21.0-0.nightly-2025-11-26-014720 and 4.21.0-0.nightly-2025-11-25-103346 (a good one), obviously there were more apiserver operator unavailable events happened on 4.21.0-0.nightly-2025-11-26-014720.
ImageStreamImportMode featureGate is enabled on 4.21.0-0.nightly-2025-11-26-014720, it }}newly introduced {{ImageStreamImportMode, that change triggered a new deployment for apiserver operator, because it is a SNO cluster, that would make the disruption testing getting worse.
it is a consequence of a Feature Gate enablement that happens during that update that causes one additional apiserver restart. It is expected to go away once the update does not cross the Feature Gate enablement again.
Per the slack discussion.
TRT is planning to just make the single node check very forgiving single node disruption monitoring barely makes any sense
We are going to extend the grace to make the disruption tests on SNO cluster to have more }}{{{}forgiving.