Uploaded image for project: 'OpenShift API for Data Protection'
  1. OpenShift API for Data Protection
  2. OADP-7419

IBU upgrade delayed due to OADP controller crash waiting for CRDs

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • OADP 1.4.7
    • oadp-operator
    • None
    • Quality / Stability / Reliability
    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • ToDo
    • Very Likely
    • 0
    • None
    • Unset
    • Unknown
    • None

      I've been analyzing the IBU (Image-Based Upgrade) process on SNO clusters upgrading from OCP 4.16 to 4.18 using OADP 1.4, and found an issue that causes a significant delay (~8 minutes) in the DPA reconciliation after the node reboots.

      During IBU upgrade, after the node reboots into the new OS version, the LCA (Lifecycle Agent) waits for the cluster to stabilize before proceeding. One of the health checks validates that the DPA has status.conditions[type="Reconciled", status="True"].

      What we observed:
      1. After reboot, the OADP controller starts and acquires the leader lease
      2. It immediately tries to sync caches for Route and SecurityContextConstraints CRDs, but they're not yet available (OpenShift API is still initializing)
      3. After 2 minutes of retries, the controller crashes with:

         ERROR: "failed to wait for dataprotectionapplication caches to sync: timed out waiting for cache to be synced for Kind *v1.SecurityContextConstraints"
      

      4. Openshift restarts the container, but the previous lease is still held (270s duration for SNO per LeaderElectionSNOConfig)
      5. The new instance must wait ~5.5 minutes for the lease to expire before it can reconcile
      6. Total delay from reboot to DPA Reconciled=True: ~8 minutes

      Evidence from logs:

      # First instance starts and acquires lease (14:43:45)
      14:43:45Z successfully acquired lease openshift-adp/oadp.openshift.io
      
      # Repeated errors every 10s - CRDs not available
      14:43:45Z ERROR "no matches for kind \"SecurityContextConstraints\" in group \"security.openshift.io\""
      14:43:55Z ERROR "no matches for kind \"Route\" in group \"route.openshift.io\""
      ... (continues every 10s for 2 minutes)
      
      # Crash after 2 minute timeout (14:45:46)
      14:45:46Z ERROR "failed to wait for dataprotectionapplication caches to sync: timed out waiting for cache to be synced"
      
      # New instance starts but must wait for old lease to expire
      14:45:46Z attempting to acquire leader lease openshift-adp/oadp.openshift.io...
      
      # Finally acquires lease after ~6 minutes (14:51:53)
      14:51:53Z successfully acquired lease
      14:51:58Z DPA Reconciled=True
      

      Given that SNO clusters using IBU will experience node reboots where the OpenShift API takes time to fully initialize, are there any improvements being considered for this scenario? The crash due to missing CRDs followed by the lease wait time significantly impacts the upgrade duration. Thanks!

              spampatt@redhat.com Shubham Pampattiwar
              dmunneor1@redhat.com Daniel Munne Ortega
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: