-
Epic
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
Fix APIService reconciliation ordering to prevent transient unavailability during bootstrap
-
To Do
-
None
-
False
-
-
False
-
None
-
None
-
None
The hosted cluster config operator reconciles APIServices, their backing Services, and Endpoints in the wrong order in resources.go, causing a race condition where the kube-apiserver aggregation layer attempts to reach APIService backends that don't yet exist. This results in transient API unavailability (503s) for OpenShift API groups during cluster bootstrap or after full re-reconciliation.
Problem
In control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go, the reconciliation order is:
APIService created — references a Service by name/namespace (line 508)
Service created — gets a ClusterIP assigned by Kubernetes (line 513)
Endpoints created — points to the control-plane Service's ClusterIP (line 522)
This ordering means there is a window where the APIService exists and is registered with the kube-apiserver aggregation layer, but the backing Service and/or Endpoints do not yet exist. During this window, the kube-apiserver marks the APIService as Available=False and returns 503 errors for all API groups served by that APIService.
Affected APIServices
- OpenShift API Server — 9 API groups (apps, authorization, build, image, quota, route, security, template, project) — resources.go lines 508-525
- OpenShift OAuth API Server — 2 API groups (oauth, user) — resources.go lines 528-545
- OLM PackageServer — 1 API group (packages.operators.coreos.com) — resources.go lines ~1940-1972
Root Cause Detail
The ReconcileAPIService function (oapi/reconcile.go:19-34) sets the APIService's .spec.service to reference the backing Service by name. It uses a manifest template object (e.g., manifests.OpenShiftAPIServerClusterService()) to get the name/namespace, but does not verify the actual Service exists in the guest cluster. Meanwhile, the Endpoints reconciliation does guard on the control-plane Service having a ClusterIP (resources.go:1504), but this guard is too late — the APIService is already registered and failing.
The reconciler uses error accumulation (errs = append(errs, ...)) rather than early return, so all three steps run in the same reconcile pass. This keeps the race window small but does not eliminate it.
Impact
- Components checking API availability during bootstrap (CVO, cluster operators, OLM) may see aggregated APIs as unavailable
- oc get --raw calls to affected API groups return 503
- Can cause cascading delays in cluster readiness as operators wait for API availability
- Most visible during initial cluster creation or after a full resource re-reconciliation
Proposed Solution
Reverse the reconciliation order so the backing infrastructure is in place before the APIService is registered:
Service created — ensure it exists and gets a ClusterIP
Endpoints created — point to the control-plane Service's ClusterIP, guard on ClusterIP being assigned
APIService created — register with kube-apiserver, which will immediately find a working backend
Additionally, guard APIService creation/update on the backing Service existing with a valid ClusterIP and Endpoints being in place. If the Service or Endpoints are not ready, skip the APIService reconciliation (returning an error to trigger a requeue) rather than creating an APIService pointing to a non-existent backend.
Finally, gate the HostedControlPlaneAvailable condition on the affected APIServices reporting Available=True. This follows the same pattern used for ControlPlaneComponents in hostedcontrolplane_controller.go, where controlPlaneComponentsAvailable checks that all components have their Available condition set before the HCP is marked as available. A similar check should verify that the OpenShift API Server, OAuth API Server, and OLM PackageServer APIServices are available, preventing the HCP from being marked ready while aggregated APIs are still unreachable.
Epic Acceptance Criteria
- For all three sets of APIServices (OpenShift API Server, OAuth API Server, OLM PackageServer), the backing Service and Endpoints are reconciled before the APIService is created or updated
- APIService creation is guarded: if the backing Service does not exist or does not have a valid ClusterIP, the APIService is not created and the reconciler requeues
- If the Endpoints are not yet populated, the APIService is not created and the reconciler requeues
- The HostedControlPlaneAvailable condition is gated on the affected APIServices reporting Available=True, following the pattern of controlPlaneComponentsAvailable in hostedcontrolplane_controller.go (lines 709-748). The HCP should not be marked available while aggregated OpenShift APIs are unreachable
- No regression in steady-state behavior: once all resources exist, reconciliation continues to keep them in sync as before
- Unit tests cover the ordering guarantee, the guard conditions, and the APIService availability gate for all three reconciliation paths
- The fix applies consistently to all three affected reconciliation paths (OpenShift API Server, OAuth API Server, OLM PackageServer)
Scope
In Scope
- Reordering reconciliation in resources.go for OpenShift API Server, OAuth API Server, and OLM PackageServer
- Adding guard conditions to APIService reconciliation functions
- Adding an APIService availability check to the HostedControlPlaneAvailable condition logic in hostedcontrolplane_controller.go, similar to the existing controlPlaneComponentsAvailable check
- Unit tests for the new ordering, guard behavior, and availability gate
Out of Scope
- Changes to the APIService spec itself or the backing Service configuration
- Changes to how the control-plane Service ClusterIP is obtained
- Broader reconciliation ordering changes outside the APIService/Service/Endpoints triple
Affected Code
- control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go — main reconciliation ordering (lines 508-545, 1431-1465, 1499-1525, ~1940-1972)
- control-plane-operator/hostedclusterconfigoperator/controllers/resources/oapi/reconcile.go — ReconcileAPIService, ReconcileEndpoints, ReconcileClusterService
- control-plane-operator/hostedclusterconfigoperator/controllers/resources/olm/packageserver.go — OLM PackageServer reconciliation
- control-plane-operator/controllers/hostedcontrolplane/hostedcontrolplane_controller.go — HostedControlPlaneAvailable condition logic (lines 705-758), new APIService availability gate alongside existing controlPlaneComponentsAvailable check