Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7405

[CVO] Race condition causes flakes in Admin ack gate test during upgrades

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • 4.11.z
    • None
    • Moderate
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Various 4.10 -> 4.11 upgrade jobs are occasionally flaking with failures like:

      disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged
      
        {Feb 13 07:48:49.101: Error getting custer version, err=client rate limiter Wait returned an error: context canceled Failure Feb 13 07:48:49.101: Error getting custer version, err=client rate limiter Wait returned an error: context canceled
      
        github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.adminAckRequiredWithMessage({0x8fcd0e0?, 0xc001de5e80?}, 0xc002c94360?, {0xc00246f2b0, 0xcd})

       

      example occurrence | search.ci query

      Version-Release number of selected component (if applicable):

      registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2023-02-13-055847

      Actual results:

      Test fails:

      disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged
      

      Expected results:

      Test gives proper signal.

      Additional info:

      Preliminary investigation:

      1. During the upgrade, the admin gate check is executed every 10 minutes as a part of the "disruption testing"
      2. The framework signals that the upgrade is completed (="stop disruption") by closing the done channel on which the test cancels the context given to the check.
      3. When this cancellation arrives exactly while the once-per-10-minutes is in-flight, this may result in cancelling c.ConfigV1().ClusterVersions().Get(ctx, "version"...) which then returns an error and hard-fails the test (the getClusterVersion method is called multiple times in the check):

      func getClusterVersion(ctx context.Context, config *restclient.Config) *configv1.ClusterVersion {
      	c, err := configv1client.NewForConfig(config)
      	if err != nil {
      		framework.Failf("Error getting config, err=%v", err)
      	}
      	cv, err := c.ConfigV1().ClusterVersions().Get(ctx, "version", metav1.GetOptions{})
      	if err != nil {
      		framework.Failf("Error getting custer version, err=%v", err)
      	}
      	return cv
      }
      

      Looking at the log from the example occurrence above:

      grep -i -e 'Gate ack' -e 'admin ack' -e 'completed upgrade' -e "Waiting for Upgradeable" -e "Disruption complete" -e "48:40.184" -e 'getting custer version' e2e.log
      902:STEP: Setting up admin ack test
      903:Feb 13 06:46:48.852: INFO: Admin ack test setup complete
      908:Feb 13 06:46:48.895: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates.
      1217:Feb 13 06:56:48.929: INFO: Gate ack-4.11-k not applicable to current version 4.10.52
      1218:Feb 13 06:56:48.929: INFO: Admin Ack verified
      1279:Feb 13 07:06:48.960: INFO: Gate ack-4.11-k not applicable to current version 4.10.52
      1280:Feb 13 07:06:48.960: INFO: Admin Ack verified
      1447:Feb 13 07:16:48.998: INFO: Gate ack-4.11-k not applicable to current version 4.10.52
      1448:Feb 13 07:16:48.998: INFO: Admin Ack verified
      1521:Feb 13 07:26:49.029: INFO: Gate ack-4.11-k not applicable to current version 4.10.52
      1522:Feb 13 07:26:49.029: INFO: Admin Ack verified
      1591:Feb 13 07:36:49.058: INFO: Gate ack-4.11-k not applicable to current version 4.10.52
      1592:Feb 13 07:36:49.058: INFO: Admin Ack verified
      1649:Feb 13 07:46:17.934: INFO: Completed upgrade to registry.build03.ci.openshift.org/ci-op-6r3md4kq/release@sha256:1a8b9b9e6a5a39e72b80c63e566fffe6dcdeb11c14199d28b3593c251b782216
      1655:Feb 13 07:46:49.091: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ...
      1662:Feb 13 07:47:58.052: INFO: All pools completed upgrade
      1828:Feb 13 07:48:40.184: INFO: Cluster operators:
      1866:STEP: Disruption complete; stopping async validations
      1895:Feb 13 07:48:40.184: INFO: Checking for object unstructured monitoring.coreos.com/v1, prometheusrules, openshift-kube-apiserver/kube-apiserver
      1929:Feb 13 07:48:49.101: FAIL: Error getting custer version, err=client rate limiter Wait returned an error: context canceled
      
      • Observe the 10m cadence between the checks (the "Admin Ack verified") messages
      • 7:46:17: The upgrade completes
      • 7:46:49: Another check starts at ("Waiting for Upgradeable..."). This check logs differently because the cluster is now at 4.11 and the ack-4.11-k gate now applies to it, triggering a more sophisticated code path that interacts with the cluster and waits until expected conditions hold
      • 7:48:40: The disruption suite starts terminating (the framework does not log time for this message, so I've included the nearest timestamped log-lines that bound it)
      • 7:48:49: The check that was still in-flight failed to get the CV from the cluster because the context is canceled.

              lmohanty@redhat.com Lalatendu Mohanty
              afri@afri.cz Petr Muller
              Evgeni Vakhonin Evgeni Vakhonin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: