Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-2454

apiserver disruption tests on SNO failing when enabling a Feature Gate.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.21
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Since 4.21.0-0.nightly-2025-11-26-014720, SNO upgrade job failed on the following disruption testing.
      E.g:
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-sin[...]-openshift-release-analysis-aggregator/1993563443491246080
      https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-aws-ovn-sin[...]-openshift-release-analysis-aggregator/1993600285062205440 

      : [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test
      : [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/reused should be available throughout the test  
      : [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/new should be available throughout the test
      { backend-disruption-name/cache-openshift-api-new-connections connection/new disruption/openshift-tests was unreachable during disruption: for at least 11m45s (maxAllowed=11m28s): P99 from historical data for similar jobs over past 3 weeks: 9m33.51s added an additional 20% of grace
      

       

      : [Monitor:apiserver-external-availability][sig-api-machinery] disruption/cache-openshift-api apiserver/openshift-apiserver connection/reused should be available throughout the test
      
      {  backend-disruption-name/cache-openshift-api-reused-connections connection/reused disruption/openshift-tests was unreachable during disruption:  for at least 11m45s (maxAllowed=11m27s):
      P99 from historical data for similar jobs over past 3 weeks: 9m32.68s
      added an additional 20% of grace 

      compared intervals between 4.21.0-0.nightly-2025-11-26-014720 and 4.21.0-0.nightly-2025-11-25-103346 (a good one), obviously there were more apiserver operator unavailable events happened on 4.21.0-0.nightly-2025-11-26-014720.

      ImageStreamImportMode featureGate is enabled on 4.21.0-0.nightly-2025-11-26-014720, it }}newly introduced {{ImageStreamImportMode, that change triggered a new deployment for apiserver operator, because it is a SNO cluster, that would make the disruption testing getting worse.

      it is a consequence of a Feature Gate enablement that happens during that update that causes one additional apiserver restart. It is expected to go away once the update does not cross the Feature Gate enablement again.

      Per the slack discussion.

      TRT is planning to just make the single node check very forgiving
      
      single node disruption monitoring barely makes any sense 

      We are going to extend the grace to make the disruption tests on SNO cluster to have more }}{{{}forgiving.

       

              jialiu@redhat.com Johnny Liu
              jialiu@redhat.com Johnny Liu
              None
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: