Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.19.z, 4.21, 4.20.z
Component/s: Cloud Compute / Cloud Controller Manager
Labels:

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None
Epic Link:
Stabilize CI conformance jobs for platform type External

Target Backport Versions:
None
Target Version:

4.21.0
Release Blocker:
Rejected
Sprint:
OpenShift SPLAT - Sprint 280
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

    Kubelet provider ID is not set on nodes in Platform External (UPI with external CCM) installations, causing Cloud Controller Manager to fail to initialize nodes. This results in nodes remaining tainted with
  "node.cloudprovider.kubernetes.io/uninitialized", preventing cluster operators from scheduling pods and causing installation failures.  The root cause is that the MachineConfig uses systemd drop-in files with Environment= directives to set KUBELET_PROVIDERID, but systemd does not expand variables from Environment= directives in ExecStart= command lines - only variables from 
  EnvironmentFile= are expanded.

Version-Release number of selected component (if applicable):

    Affects OpenShift 4.19, 4.20, 4.21 (observed in nightly releases)
  Component: Machine Config Operator / Platform External installation
  Upstream component: kubernetes/cloud-provider-aws (CCM regression in commit c111ea6c122c from Feb 23, 2025)

How reproducible:

    Intermittent - timing race condition
  - If /etc/kubernetes/kubelet-env exists from a previous MCO run, nodes work
  - Fresh nodes without pre-existing kubelet-env fail with empty provider ID

Steps to Reproduce:

  Reproducible in CI periodic jobs:
  - periodic-ci-openshift-release-master-nightly-4.19-e2e-external-aws-ccm
  - periodic-ci-openshift-release-master-nightly-4.20-e2e-external-aws-ccm
  - periodic-ci-openshift-release-master-nightly-4.21-e2e-external-aws-ccm

Actual results:

  - Kubelet runs with empty --provider-id= argument
  - Nodes remain tainted with node.cloudprovider.kubernetes.io/uninitialized
  - CCM fails to initialize nodes
  - Cluster operators cannot schedule pods on uninitialized nodes
  - Installation times out

  Example from live cluster debugging:
  $ cat /etc/systemd/system/kubelet.service.d/20-providerid.conf
  [Service]
  Environment=KUBELET_PROVIDERID=aws:///us-east-1a/i-0037cc9e4e43b84b7  $ ps aux | grep kubelet
  --provider-id=   # EMPTY!

Expected results:

  - Kubelet should run with --provider-id=aws:///zone/instance-id
  - CCM should successfully initialize nodes and remove uninitialized taint
  - Cluster operators should be able to schedule pods on all nodes
  - Installation should complete successfully

Additional info:

  **Root Cause:**

  Systemd does not expand variables from Environment= directives in ExecStart= lines. Only variables from EnvironmentFile= are expanded in command execution.  

UPI kubelet.service contains:
  EnvironmentFile=-/etc/kubernetes/kubelet-env
  ExecStart=/usr/local/bin/kubenswrapper /usr/bin/kubelet --provider-id=${KUBELET_PROVIDERID}  

The kubelet.service already reads /etc/kubernetes/kubelet-env, but the MachineConfig was writing to a drop-in file instead.  

**Fix:**
  Write KUBELET_PROVIDERID to /etc/kubernetes/kubelet-env instead of using systemd drop-in files.  

**Fix PR:** https://github.com/openshift/release/pull/71807  

**Related Upstream Issue:**
  CCM regression in kubernetes/cloud-provider-aws commit c111ea6c122c (PR kubernetes/cloud-provider-aws#1109, Feb 23, 2025) - InstanceMetadata() now bypasses node informer cache, exposing timing issues where nodes don't have provider ID set.  

**Affected Code:**
  ci-operator/step-registry/platform-external/pre/conf/manifests/platform-external-pre-conf-manifests-commands.sh  **Live Cluster Evidence:**
  In failed rehearsal jobs, 2 out of 3 master nodes had empty provider ID while 1 master and all 3 workers had correct provider ID (race condition - workers likely had pre-existing kubelet-env from MCO).

links to

openshift/release#71807: OCPBUGS-66040: platform-external: Fix kubelet provider-id reliability issue

Assignee:: Marco Braga

Reporter:: Marco Braga

Need Info From:: None

Contributors:: None

QA Contact:: Zhaohua Sun

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/11/25 7:49 PM

Updated:: 2025/12/01 5:45 PM

Resolved:: 2025/12/01 5:44 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates