-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.20.z
-
None
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Problem
When a new node joins a hosted cluster, some time there might be a ~15-minute delay before the global-pull-secret-syncer DaemonSet pod is scheduled on that node. The DaemonSet pod should be scheduled within seconds of the node becoming Ready.
Customer-Observed Symptoms
default 1h22m Normal NodeReady node/arohcp4-spark-e32-az3-qg699-bqrws Node status is now: NodeReady
kube-system 1h8m Normal Scheduled pod/global-pull-secret-syncer-x6qwv Successfully assigned to arohcp4-spark-e32-az3-qg699-bqrws
There is a 14-minute gap between the node becoming Ready and the syncer pod being scheduled.
Potential Root Cause
The globalps controller in the Hosted Cluster Config Operator (HCCO) has a race condition between the Node CREATE event and the Machine.Status.NodeRef being set on the management cluster. The combination of five factors causes the delay:
- Race condition: When a Node is created on the hosted cluster, the globalps controller immediately receives a CREATE event and runs reconciliation. However, the CAPI Machine controller on the management cluster has not yet set Machine.Status.NodeRef for the corresponding Machine. The labelNodesForGlobalPullSecret() function relies on Machine.Status.NodeRef to map Machines to Nodes. Since NodeRef is nil, the new node is skipped and does not receive the hypershift.openshift.io/nodepool-globalps-enabled: "true" label.
- No Machine watch: The controller does not watch Machine objects on the management cluster. When CAPI later sets Machine.Status.NodeRef, no event is fired to the globalps controller.
- Node UPDATE events are filtered out: The controller's predicate explicitly returns false for node updates (setup.go line 145), so even cache resync events do not trigger a re-reconcile for nodes.
- No RequeueAfter: The Reconcile() function returns ctrl.Result{} with no RequeueAfter, so there is no automatic retry after the missed labeling.
- Eventual trigger is opportunistic: The node finally gets labeled only when an unrelated event triggers a reconcile (e.g., a Secret change in kube-system). The delay depends entirely on when that next event occurs.
Impact
- New nodes do not have the global pull secret syncer running for several minutes after becoming Ready
- Workloads scheduled on those nodes during this window may fail to pull images from private registries that require the merged pull secret
- The delay is non-deterministic and depends on unrelated cluster activity