Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45095

[LSO] optimize requeue time when LV / LVS is deleted

XMLWordPrintable

    • Low
    • No
    • False
    • Hide

      None

      Show
      None
    • NA
    • Release Note Not Required
    • In Progress

      Description of problem:

      The LSO diskmaker controllers requeue the Reconcile loop using fast requeue (5 seconds) here when the LV or LVS has a deletionTimestamp:
      
      https://github.com/openshift/local-storage-operator/blob/4a83c5239a57ff5e2c4053d8dcbd95e5f855a75a/pkg/diskmaker/controllers/lv/reconcile.go#L347
      https://github.com/openshift/local-storage-operator/blob/4a83c5239a57ff5e2c4053d8dcbd95e5f855a75a/pkg/diskmaker/controllers/lvset/reconcile.go#L133 
      
      This was to solve a problem where the PV did not get deleted after:
      
      1) Create localvolumeset
      2) Create pvc/pod, pv is provisioned
      3) Delete localvolumeset
      4) Delete pod/pvc
      
      The problem was the PV would transition to Released, which starts the reconcile loop, and DeletePVs() would be called once to start the cleanup job. But then there were no events after that to trigger reconcile again. DeletePVs() must be called at least twice. Once to start the cleanup job, and another time to remove the completed cleanup job and delete the PV. That's why we need a fast requeue when a cleanup job is in progress, so we don't hold up PV deletion indefinitely.
      
      However... looping through the Reconcile loop every 5 seconds is a little too agressive if for example there is still a Bound PV. In that case the fast requeue won't help, it will just waste CPU cycles. We'll get an event to trigger reconcile if/when that PV changes to Released and needs to be cleaned.
      
      So one way I see that we could optimize this, is to only use the fast requeue time (5 seconds) if there is at least one PV in the Released state, which implies a cleanup job may be in progress. Otherwise, use the default requeue time (60 seconds), and rely on events to trigger reconcile when the PV changes state.

      Version-Release number of selected component (if applicable):

          4.18.0

      How reproducible:

          

      Steps to Reproduce:

          1. Create LocalVolume
          2. Create PVC / Pod
          3. Delete LocalVolume
          4. Observe logs while PV is still Bound
          

      Actual results:

          Fast requeue (5 sec) loop while PV is bound

      Expected results:

          Default requeue (60 sec) loop while PV is bound

      Additional info:

          

              jdobson@redhat.com Jonathan Dobson
              jdobson@redhat.com Jonathan Dobson
              Wei Duan Wei Duan
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: