-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
CRD Safety Revision Controller
-
To Do
-
Product / Portfolio Work
-
-
60% To Do, 40% In Progress, 0% Done
-
False
-
-
False
-
Not Selected
-
M
-
None
-
None
Epic Goal
Implement the revision controller for the CAPI installer. This controller watches provider images and the ClusterAPI operator config, computes the desired set of manifests for the current cluster, and writes Revisions to the ClusterAPI status. Revisions
are then consumed by the installer controller (separate epic) which applies them using Boxcutter.
Why is this important?
The current CAPI installer applies provider manifests directly during reconciliation with no persistent record of what was applied or what the desired state is. This makes upgrades, rollbacks, and partial failures hard to reason about. Revisions give us:
- A persistent, auditable record of what manifests are desired and what is currently applied
- Clean separation between "decide what to install" (revision controller) and "install it" (installer controller)
- Support for upgrade sequencing: the installer compares currentRevision and desiredRevision to determine whether an upgrade is needed
- A mechanism for orphan cleanup: when a revision changes, the installer can identify and remove resources from a previous revision that aren't in the current one
- A path to unmanagedCustomResourceDefinitions, where users can opt CRDs out of operator management
Scenarios
1. Initial install: ClusterAPI singleton exists (created by CVO manifest) but has no revisions. The revision controller reads the Infrastructure singleton to determine platform, filters provider profiles, computes a contentID, creates the first revision
(revision: 1), and sets desiredRevision.
2. Upgrade: Provider images change (new release payload). The revision controller detects the contentID differs from the latest revision, creates a new revision with an incremented revision number, and updates desiredRevision. The old revision remains
until the installer controller confirms the new one is fully applied.
3. No-op reconcile: Provider images haven't changed. The latest revision's contentID matches the computed contentID. No action taken.
4. Unmanaged CRDs: A user adds a CRD name to spec.unmanagedCustomResourceDefinitions. The revision controller propagates this to the new revision. The installer controller skips those CRDs when applying.
5. Max revisions reached: If 16 revisions accumulate (indicates a bug or environmental problem), the controller stops creating new revisions, sets a non-retryable Degraded condition, and does not requeue. Manual intervention required.
6. Infrastructure not ready: If the Infrastructure singleton doesn't exist or PlatformStatus is nil, the controller sets a WaitingOnExternal condition and requeues.
Acceptance Criteria
- Revision controller watches the ClusterAPI singleton (name: cluster) and the Infrastructure singleton
- Infrastructure watch only triggers reconcile when PlatformStatus transitions from nil to non-nil
- Profiles filtered by platform: providers with no OCPPlatform match all clusters, providers with a specific platform match only that platform
- Components sorted by InstallOrder ascending, then no-platform before platform-specific, then by name as tiebreaker
- ContentID is SHA256 of the concatenated ContentID fields of all matching provider profiles (deterministic, order-dependent)
- Revision name format: <releaseVersion><contentID[:8]><revisionNumber>, truncated to 255 chars
- No new revision created if latest revision's contentID already matches (idempotent)
- Max 16 revisions enforced; exceeding this is a non-retryable error
- Updates RevisionControllerProgressing and RevisionControllerDegraded conditions on the cluster-api ClusterOperator
- Ephemeral errors set Progressing=True; persistent errors (>5min) set Degraded=True
- Non-retryable errors (e.g. max revisions) set Degraded=True, do not requeue
- CI running successfully with tests automated
Release Technical Enablement
- Enhancement: https://github.com/openshift/enhancements/pull/1918
- API: https://github.com/openshift/api/pull/2564
- TechPreview only, no user-facing behavior changes beyond the new ClusterAPI resource being present
Dependencies (internal and external)
Open Questions
- Revision pruning: the current implementation does not prune old revisions. Should the revision controller prune revisions older than currentRevision, or is that the installer controller's responsibility?
- Should unmanagedCustomResourceDefinitions from the spec be included in the contentID calculation (i.e. trigger a new revision when it changes)?
Done Checklist
- CI: CI is running, tests are automated and merged.
- is related to
-
OCPCLOUD-3317 CAPI Installer upgrade safety
-
- In Progress
-
- links to