-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.11.z
-
None
-
Moderate
-
None
-
False
-
Description of problem:
Various 4.10 -> 4.11 upgrade jobs are occasionally flaking with failures like:
disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged {Feb 13 07:48:49.101: Error getting custer version, err=client rate limiter Wait returned an error: context canceled Failure Feb 13 07:48:49.101: Error getting custer version, err=client rate limiter Wait returned an error: context canceled github.com/openshift/origin/test/extended/util/openshift/clusterversionoperator.adminAckRequiredWithMessage({0x8fcd0e0?, 0xc001de5e80?}, 0xc002c94360?, {0xc00246f2b0, 0xcd})
example occurrence | search.ci query
Version-Release number of selected component (if applicable):
registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2023-02-13-055847
Actual results:
Test fails:
disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged
Expected results:
Test gives proper signal.
Additional info:
Preliminary investigation:
1. During the upgrade, the admin gate check is executed every 10 minutes as a part of the "disruption testing"
2. The framework signals that the upgrade is completed (="stop disruption") by closing the done channel on which the test cancels the context given to the check.
3. When this cancellation arrives exactly while the once-per-10-minutes is in-flight, this may result in cancelling c.ConfigV1().ClusterVersions().Get(ctx, "version"...) which then returns an error and hard-fails the test (the getClusterVersion method is called multiple times in the check):
func getClusterVersion(ctx context.Context, config *restclient.Config) *configv1.ClusterVersion { c, err := configv1client.NewForConfig(config) if err != nil { framework.Failf("Error getting config, err=%v", err) } cv, err := c.ConfigV1().ClusterVersions().Get(ctx, "version", metav1.GetOptions{}) if err != nil { framework.Failf("Error getting custer version, err=%v", err) } return cv }
Looking at the log from the example occurrence above:
grep -i -e 'Gate ack' -e 'admin ack' -e 'completed upgrade' -e "Waiting for Upgradeable" -e "Disruption complete" -e "48:40.184" -e 'getting custer version' e2e.log 902:STEP: Setting up admin ack test 903:Feb 13 06:46:48.852: INFO: Admin ack test setup complete 908:Feb 13 06:46:48.895: INFO: Skipping admin ack test. Admin ack is not in this baseline or contains no gates. 1217:Feb 13 06:56:48.929: INFO: Gate ack-4.11-k not applicable to current version 4.10.52 1218:Feb 13 06:56:48.929: INFO: Admin Ack verified 1279:Feb 13 07:06:48.960: INFO: Gate ack-4.11-k not applicable to current version 4.10.52 1280:Feb 13 07:06:48.960: INFO: Admin Ack verified 1447:Feb 13 07:16:48.998: INFO: Gate ack-4.11-k not applicable to current version 4.10.52 1448:Feb 13 07:16:48.998: INFO: Admin Ack verified 1521:Feb 13 07:26:49.029: INFO: Gate ack-4.11-k not applicable to current version 4.10.52 1522:Feb 13 07:26:49.029: INFO: Admin Ack verified 1591:Feb 13 07:36:49.058: INFO: Gate ack-4.11-k not applicable to current version 4.10.52 1592:Feb 13 07:36:49.058: INFO: Admin Ack verified 1649:Feb 13 07:46:17.934: INFO: Completed upgrade to registry.build03.ci.openshift.org/ci-op-6r3md4kq/release@sha256:1a8b9b9e6a5a39e72b80c63e566fffe6dcdeb11c14199d28b3593c251b782216 1655:Feb 13 07:46:49.091: INFO: Waiting for Upgradeable to be AdminAckRequired for "Kubernetes 1.25 and therefore OpenShift 4.12 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6955381 for details and instructions." ... 1662:Feb 13 07:47:58.052: INFO: All pools completed upgrade 1828:Feb 13 07:48:40.184: INFO: Cluster operators: 1866:STEP: Disruption complete; stopping async validations 1895:Feb 13 07:48:40.184: INFO: Checking for object unstructured monitoring.coreos.com/v1, prometheusrules, openshift-kube-apiserver/kube-apiserver 1929:Feb 13 07:48:49.101: FAIL: Error getting custer version, err=client rate limiter Wait returned an error: context canceled
- Observe the 10m cadence between the checks (the "Admin Ack verified") messages
- 7:46:17: The upgrade completes
- 7:46:49: Another check starts at ("Waiting for Upgradeable..."). This check logs differently because the cluster is now at 4.11 and the ack-4.11-k gate now applies to it, triggering a more sophisticated code path that interacts with the cluster and waits until expected conditions hold
- 7:48:40: The disruption suite starts terminating (the framework does not log time for this message, so I've included the nearest timestamped log-lines that bound it)
- 7:48:49: The check that was still in-flight failed to get the CV from the cluster because the context is canceled.
- is triggered by
-
OTA-899 Add OCP 4.12-to-4.13 admin ack
- Closed
- relates to
-
OCPBUGS-7766 admin ack test sometimes fails because upgradable=false shows up too slowly after upgrade
- New