-
Bug
-
Resolution: Done-Errata
-
Major
-
4.12.z
-
Moderate
-
No
-
5
-
OTA 243, OTA 244, OTA 246, OTA 247, OTA 248
-
5
-
Rejected
-
False
-
-
Release Note Not Required
-
In Progress
Description of problem:
After the fix for OCPBUGSM-44759, we put timeouts on payload retrieval operations (verification and download); previously they were uncapped and under certain network circumstances could take hours to terminate. Testing the fix uncovered a problem where, after CVO passes the path with the timeouts, CVO starts logging errors for the core manifest reconciliation loop:
I0208 11:22:57.107819 1 sync_worker.go:993] Running sync for role "openshift-marketplace/marketplace-operator" (648 of 834) I0208 11:22:57.107887 1 task_graph.go:474] Canceled worker 1 while waiting for work I0208 11:22:57.107900 1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834) I0208 11:22:57.107911 1 task_graph.go:474] Canceled worker 0 while waiting for work I0208 11:22:57.107918 1 task_graph.go:523] Workers finished I0208 11:22:57.107925 1 task_graph.go:546] Result of work: [update context deadline exceeded at 8 of 834 Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)] I0208 11:22:57.107938 1 sync_worker.go:1169] Summarizing 1 errors I0208 11:22:57.107947 1 sync_worker.go:1173] Update error 648 of 834: UpdatePayloadFailed Could not update role "openshift-marketplace/marketplace-operator" (648 of 834) (context.deadlineExceededError: context deadline exceeded) E0208 11:22:57.107966 1 sync_worker.go:654] unable to synchronize image (waiting 3m39.457405047s): Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)
This is caused by locks. The SyncWorker.Update method acquires its lock for its whole duration. The payloadRetriever.RetrievePayload method is called inside SyncWorker.Update, on the following call chain:
SyncWorker.Update -> SyncWorker.loadUpdatedPayload -> SyncWorker.syncPayload -> payloadRetriever.RetrievePayload
RetrievePayload can take 2 or 4 minutes before it timeouts, so CVO holds the lock for this whole wait.
The manifest reconciliation loop is implemented in the apply method. The whole apply method is bounded by a timeout context set to 2*minimum reconcile interval so it will be set to a value between 4 and 8 minutes. While in the reconciling mode, the manifest graph is split into multiple "tasks" where smaller sequences of these tasks are applied in parallel. Individual tasks in these series are iterated over and each iteration uses a consistentReporter to report status via its Update method, which also acquires the lock on the following call sequence:
SyncWorker.apply -> { for _, task := range tasks ... -> consistentReporter.Update -> statusWrapper.Report -> SyncWorker.updateApplyStatus ->
This leads to the following sequence:
1. apply is called with a timeout between 4 and 8 minutes
2. in parallel, SyncWorker.Update starts and acquires the lock
3. tasks under apply wait on the reporter to acquire lock
4. after 2 or 4 minutes RetrievePayload under SyncWorker.Update timeout and terminate, SyncWorker.Update terminates and releases the lock
5. tasks under apply report results after briefly acquiring the lock, start to do their thing
6. in parallel, SyncWorker.Update starts again and acquires the lock
7. further iterations over tasks under apply wait on the reporter to acquire lock
8. context passed to apply times out
9. Canceled worker 0 while waiting for work... errors
Version-Release number of selected component (if applicable):
4.13.0-0.ci.test-2023-02-06-062603 with https://github.com/openshift/cluster-version-operator/pull/896
How reproducible:
always in certain cluster configuration
Steps to Reproduce:
1. in a disconnected cluster, upgrade to an unrechachable payload image with --force
2. observe the CVO log
Actual results:
CVO starts to fail reconciling manifests
Expected results:
no failures, cluster continues to try retrieving the image but no interference with manifest reconciliation
Additional info:
This problem was discovered by evakhoni@redhat.com while testing fix for OCPBUGSM-44759: https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22
https://github.com/openshift/cluster-version-operator/pull/896 uncovers this issue but still gets CVO into a better shape - previously the RetrievePayload could be running for a much longer time (hours), preventing the CVO from working at all.
When the cluster gets into this buggy state, the solution is to abort the upgrade that fails to verify or download.
- is duplicated by
-
OCPBUGS-15871 Could not update clusterrolebinding "csi-snapshot-controller-runner-operator" blocks 4.12.23 arm64 upgrade to 4.13.4 arm64
- Closed
- is related to
-
OCPBUGS-22266 OpenShift 4.14 Upgrade with baselineCapabilties: None leaves cluster operators behind on lower versions
- ASSIGNED
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update