Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.20.z
Component/s: HyperShift
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Problem

When a new node joins a hosted cluster, some time there might be a ~15-minute delay before the global-pull-secret-syncer DaemonSet pod is scheduled on that node. The DaemonSet pod should be scheduled within seconds of the node becoming Ready.

Customer-Observed Symptoms

default       1h22m  Normal  NodeReady   node/arohcp4-spark-e32-az3-qg699-bqrws   Node status is now: NodeReady
kube-system   1h8m   Normal  Scheduled   pod/global-pull-secret-syncer-x6qwv       Successfully assigned to arohcp4-spark-e32-az3-qg699-bqrws

There is a 14-minute gap between the node becoming Ready and the syncer pod being scheduled.

Potential Root Cause

The globalps controller in the Hosted Cluster Config Operator (HCCO) has a race condition between the Node CREATE event and the Machine.Status.NodeRef being set on the management cluster. The combination of five factors causes the delay:

Race condition: When a Node is created on the hosted cluster, the globalps controller immediately receives a CREATE event and runs reconciliation. However, the CAPI Machine controller on the management cluster has not yet set Machine.Status.NodeRef for the corresponding Machine. The labelNodesForGlobalPullSecret() function relies on Machine.Status.NodeRef to map Machines to Nodes. Since NodeRef is nil, the new node is skipped and does not receive the hypershift.openshift.io/nodepool-globalps-enabled: "true" label.
No Machine watch: The controller does not watch Machine objects on the management cluster. When CAPI later sets Machine.Status.NodeRef, no event is fired to the globalps controller.
Node UPDATE events are filtered out: The controller's predicate explicitly returns false for node updates (setup.go line 145), so even cache resync events do not trigger a re-reconcile for nodes.
No RequeueAfter: The Reconcile() function returns ctrl.Result{} with no RequeueAfter, so there is no automatic retry after the missed labeling.
Eventual trigger is opportunistic: The node finally gets labeled only when an unrelated event triggers a reconcile (e.g., a Secret change in kube-system). The delay depends entirely on when that next event occurs.

Impact

New nodes do not have the global pull secret syncer running for several minutes after becoming Ready
Workloads scheduled on those nodes during this window may fail to pull images from private registries that require the merged pull secret
The delay is non-deterministic and depends on unrelated cluster activity

Assignee:: Unassigned

Reporter:: Jude Zhu

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2026/02/24 10:58 PM

Updated:: 2026/02/24 11:20 PM

Details

Description

Problem

Customer-Observed Symptoms

Potential Root Cause

Impact

Attachments

Easy Agile Planning Poker

Activity

People

Dates