-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.19.z, 4.21, 4.20.z
-
None
-
False
-
-
None
-
None
-
None
-
None
-
Rejected
-
OpenShift SPLAT - Sprint 280
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Kubelet provider ID is not set on nodes in Platform External (UPI with external CCM) installations, causing Cloud Controller Manager to fail to initialize nodes. This results in nodes remaining tainted with "node.cloudprovider.kubernetes.io/uninitialized", preventing cluster operators from scheduling pods and causing installation failures. The root cause is that the MachineConfig uses systemd drop-in files with Environment= directives to set KUBELET_PROVIDERID, but systemd does not expand variables from Environment= directives in ExecStart= command lines - only variables from EnvironmentFile= are expanded.
Version-Release number of selected component (if applicable):
Affects OpenShift 4.19, 4.20, 4.21 (observed in nightly releases) Component: Machine Config Operator / Platform External installation Upstream component: kubernetes/cloud-provider-aws (CCM regression in commit c111ea6c122c from Feb 23, 2025)
How reproducible:
Intermittent - timing race condition - If /etc/kubernetes/kubelet-env exists from a previous MCO run, nodes work - Fresh nodes without pre-existing kubelet-env fail with empty provider ID
Steps to Reproduce:
Reproducible in CI periodic jobs: - periodic-ci-openshift-release-master-nightly-4.19-e2e-external-aws-ccm - periodic-ci-openshift-release-master-nightly-4.20-e2e-external-aws-ccm - periodic-ci-openshift-release-master-nightly-4.21-e2e-external-aws-ccm
Actual results:
- Kubelet runs with empty --provider-id= argument - Nodes remain tainted with node.cloudprovider.kubernetes.io/uninitialized - CCM fails to initialize nodes - Cluster operators cannot schedule pods on uninitialized nodes - Installation times out Example from live cluster debugging: $ cat /etc/systemd/system/kubelet.service.d/20-providerid.conf [Service] Environment=KUBELET_PROVIDERID=aws:///us-east-1a/i-0037cc9e4e43b84b7 $ ps aux | grep kubelet --provider-id= # EMPTY!
Expected results:
- Kubelet should run with --provider-id=aws:///zone/instance-id - CCM should successfully initialize nodes and remove uninitialized taint - Cluster operators should be able to schedule pods on all nodes - Installation should complete successfully
Additional info:
**Root Cause:**
Systemd does not expand variables from Environment= directives in ExecStart= lines. Only variables from EnvironmentFile= are expanded in command execution.
UPI kubelet.service contains:
EnvironmentFile=-/etc/kubernetes/kubelet-env
ExecStart=/usr/local/bin/kubenswrapper /usr/bin/kubelet --provider-id=${KUBELET_PROVIDERID}
The kubelet.service already reads /etc/kubernetes/kubelet-env, but the MachineConfig was writing to a drop-in file instead.
**Fix:**
Write KUBELET_PROVIDERID to /etc/kubernetes/kubelet-env instead of using systemd drop-in files.
**Fix PR:** https://github.com/openshift/release/pull/71807
**Related Upstream Issue:**
CCM regression in kubernetes/cloud-provider-aws commit c111ea6c122c (PR kubernetes/cloud-provider-aws#1109, Feb 23, 2025) - InstanceMetadata() now bypasses node informer cache, exposing timing issues where nodes don't have provider ID set.
**Affected Code:**
ci-operator/step-registry/platform-external/pre/conf/manifests/platform-external-pre-conf-manifests-commands.sh **Live Cluster Evidence:**
In failed rehearsal jobs, 2 out of 3 master nodes had empty provider ID while 1 master and all 3 workers had correct provider ID (race condition - workers likely had pre-existing kubelet-env from MCO).