Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: OADP 1.4.7
Component/s: oadp-operator
Labels:
None

Activity Type:
Quality / Stability / Reliability
Workstream:

None
Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
QEStatus:
ToDo
Intelligence Requested:
Market:
RH Private Keywords:

Risk Probability:
Very Likely
Risk Score:
0

Root Cause:
Unset
Failure Category:
Unknown

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

I've been analyzing the IBU (Image-Based Upgrade) process on SNO clusters upgrading from OCP 4.16 to 4.18 using OADP 1.4, and found an issue that causes a significant delay (~8 minutes) in the DPA reconciliation after the node reboots.

During IBU upgrade, after the node reboots into the new OS version, the LCA (Lifecycle Agent) waits for the cluster to stabilize before proceeding. One of the health checks validates that the DPA has status.conditions[type="Reconciled", status="True"].

What we observed:
1. After reboot, the OADP controller starts and acquires the leader lease
2. It immediately tries to sync caches for Route and SecurityContextConstraints CRDs, but they're not yet available (OpenShift API is still initializing)
3. After 2 minutes of retries, the controller crashes with:

   ERROR: "failed to wait for dataprotectionapplication caches to sync: timed out waiting for cache to be synced for Kind *v1.SecurityContextConstraints"

4. Openshift restarts the container, but the previous lease is still held (270s duration for SNO per LeaderElectionSNOConfig)
5. The new instance must wait ~5.5 minutes for the lease to expire before it can reconcile
6. Total delay from reboot to DPA Reconciled=True: ~8 minutes

Evidence from logs:

# First instance starts and acquires lease (14:43:45)
14:43:45Z successfully acquired lease openshift-adp/oadp.openshift.io

# Repeated errors every 10s - CRDs not available
14:43:45Z ERROR "no matches for kind \"SecurityContextConstraints\" in group \"security.openshift.io\""
14:43:55Z ERROR "no matches for kind \"Route\" in group \"route.openshift.io\""
... (continues every 10s for 2 minutes)

# Crash after 2 minute timeout (14:45:46)
14:45:46Z ERROR "failed to wait for dataprotectionapplication caches to sync: timed out waiting for cache to be synced"

# New instance starts but must wait for old lease to expire
14:45:46Z attempting to acquire leader lease openshift-adp/oadp.openshift.io...

# Finally acquires lease after ~6 minutes (14:51:53)
14:51:53Z successfully acquired lease
14:51:58Z DPA Reconciled=True

Given that SNO clusters using IBU will experience node reboots where the OpenShift API takes time to fully initialize, are there any improvements being considered for this scenario? The crash due to missing CRDs followed by the lease wait time significantly impacts the upgrade duration. Thanks!

is cloned by

OADP-7508 IBU upgrade delayed due to OADP controller crash waiting for CRDs

ON_QA

links to

[oadp-1.5] Fix IBU delay on SNO by waiting for CRDs and disabling leader election #2085

openshift/oadp-operator#2082: Fix IBU delay on SNO by waiting for CRDs and disabling leader election

openshift/oadp-operator#2086: [oadp-1.4] Fix IBU delay on SNO by waiting for CRDs and disabling leader election

Assignee:: Shubham Pampattiwar

Reporter:: Daniel Munne Ortega

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2026/02/04 3:52 PM

Updated:: 2026/02/23 6:46 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates