Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7714

Holding lock during slow payload verification/download interferes with reconciling manifests

XMLWordPrintable

    • Moderate
    • No
    • 5
    • OTA 243, OTA 244, OTA 246, OTA 247, OTA 248
    • 5
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

      After the fix for OCPBUGSM-44759, we put timeouts on payload retrieval operations (verification and download); previously they were uncapped and under certain network circumstances could take hours to terminate. Testing the fix uncovered a problem where, after CVO passes the path with the timeouts, CVO starts logging errors for the core manifest reconciliation loop:

      I0208 11:22:57.107819       1 sync_worker.go:993] Running sync for role "openshift-marketplace/marketplace-operator" (648 of 834)
      I0208 11:22:57.107887       1 task_graph.go:474] Canceled worker 1 while waiting for work
      I0208 11:22:57.107900       1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834)
      I0208 11:22:57.107911       1 task_graph.go:474] Canceled worker 0 while waiting for work                                                                                                                                                                                                                              
      I0208 11:22:57.107918       1 task_graph.go:523] Workers finished
      I0208 11:22:57.107925       1 task_graph.go:546] Result of work: [update context deadline exceeded at 8 of 834 Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)]
      I0208 11:22:57.107938       1 sync_worker.go:1169] Summarizing 1 errors
      I0208 11:22:57.107947       1 sync_worker.go:1173] Update error 648 of 834: UpdatePayloadFailed Could not update role "openshift-marketplace/marketplace-operator" (648 of 834) (context.deadlineExceededError: context deadline exceeded)
      E0208 11:22:57.107966       1 sync_worker.go:654] unable to synchronize image (waiting 3m39.457405047s): Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)
      

      This is caused by locks. The SyncWorker.Update method acquires its lock for its whole duration. The payloadRetriever.RetrievePayload method is called inside SyncWorker.Update, on the following call chain:

      SyncWorker.Update ->
        SyncWorker.loadUpdatedPayload ->
          SyncWorker.syncPayload ->
            payloadRetriever.RetrievePayload
      

      RetrievePayload can take 2 or 4 minutes before it timeouts, so CVO holds the lock for this whole wait.

      The manifest reconciliation loop is implemented in the apply method. The whole apply method is bounded by a timeout context set to 2*minimum reconcile interval so it will be set to a value between 4 and 8 minutes. While in the reconciling mode, the manifest graph is split into multiple "tasks" where smaller sequences of these tasks are applied in parallel. Individual tasks in these series are iterated over and each iteration uses a consistentReporter to report status via its Update method, which also acquires the lock on the following call sequence:

      SyncWorker.apply ->
        { for _, task := range tasks ... ->
          consistentReporter.Update ->
            statusWrapper.Report ->
              SyncWorker.updateApplyStatus ->
      

      This leads to the following sequence:

      1. apply is called with a timeout between 4 and 8 minutes
      2. in parallel, SyncWorker.Update starts and acquires the lock
      3. tasks under apply wait on the reporter to acquire lock
      4. after 2 or 4 minutes RetrievePayload under SyncWorker.Update timeout and terminate, SyncWorker.Update terminates and releases the lock
      5. tasks under apply report results after briefly acquiring the lock, start to do their thing
      6. in parallel, SyncWorker.Update starts again and acquires the lock
      7. further iterations over tasks under apply wait on the reporter to acquire lock
      8. context passed to apply times out
      9. Canceled worker 0 while waiting for work... errors

      Version-Release number of selected component (if applicable):

      4.13.0-0.ci.test-2023-02-06-062603 with https://github.com/openshift/cluster-version-operator/pull/896

      How reproducible:

      always in certain cluster configuration

      Steps to Reproduce:

      1. in a disconnected cluster, upgrade to an unrechachable payload image with --force
      2. observe the CVO log

      Actual results:

      CVO starts to fail reconciling manifests

      Expected results:

      no failures, cluster continues to try retrieving the image but no interference with manifest reconciliation

      Additional info:

      This problem was discovered by evakhoni@redhat.com while testing fix for OCPBUGSM-44759: https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c22

      https://github.com/openshift/cluster-version-operator/pull/896 uncovers this issue but still gets CVO into a better shape - previously the RetrievePayload could be running for a much longer time (hours), preventing the CVO from working at all.

      When the cluster gets into this buggy state, the solution is to abort the upgrade that fails to verify or download.

              afri@afri.cz Petr Muller
              afri@afri.cz Petr Muller
              Jian Li Jian Li
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: