-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
OADP 1.4.7
-
None
-
Quality / Stability / Reliability
-
3
-
False
-
-
False
-
ToDo
-
-
-
-
Very Likely
-
0
-
None
-
Unset
-
Unknown
-
None
I've been analyzing the IBU (Image-Based Upgrade) process on SNO clusters upgrading from OCP 4.16 to 4.18 using OADP 1.4, and found an issue that causes a significant delay (~8 minutes) in the DPA reconciliation after the node reboots.
During IBU upgrade, after the node reboots into the new OS version, the LCA (Lifecycle Agent) waits for the cluster to stabilize before proceeding. One of the health checks validates that the DPA has status.conditions[type="Reconciled", status="True"].
What we observed:
1. After reboot, the OADP controller starts and acquires the leader lease
2. It immediately tries to sync caches for Route and SecurityContextConstraints CRDs, but they're not yet available (OpenShift API is still initializing)
3. After 2 minutes of retries, the controller crashes with:
ERROR: "failed to wait for dataprotectionapplication caches to sync: timed out waiting for cache to be synced for Kind *v1.SecurityContextConstraints"
4. Openshift restarts the container, but the previous lease is still held (270s duration for SNO per LeaderElectionSNOConfig)
5. The new instance must wait ~5.5 minutes for the lease to expire before it can reconcile
6. Total delay from reboot to DPA Reconciled=True: ~8 minutes
Evidence from logs:
# First instance starts and acquires lease (14:43:45) 14:43:45Z successfully acquired lease openshift-adp/oadp.openshift.io # Repeated errors every 10s - CRDs not available 14:43:45Z ERROR "no matches for kind \"SecurityContextConstraints\" in group \"security.openshift.io\"" 14:43:55Z ERROR "no matches for kind \"Route\" in group \"route.openshift.io\"" ... (continues every 10s for 2 minutes) # Crash after 2 minute timeout (14:45:46) 14:45:46Z ERROR "failed to wait for dataprotectionapplication caches to sync: timed out waiting for cache to be synced" # New instance starts but must wait for old lease to expire 14:45:46Z attempting to acquire leader lease openshift-adp/oadp.openshift.io... # Finally acquires lease after ~6 minutes (14:51:53) 14:51:53Z successfully acquired lease 14:51:58Z DPA Reconciled=True
Given that SNO clusters using IBU will experience node reboots where the OpenShift API takes time to fully initialize, are there any improvements being considered for this scenario? The crash due to missing CRDs followed by the lease wait time significantly impacts the upgrade duration. Thanks!