Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66040

Platform type External CI jobs failing on install with CCM

XMLWordPrintable

    • None
    • Rejected
    • OpenShift SPLAT - Sprint 280
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          Kubelet provider ID is not set on nodes in Platform External (UPI with external CCM) installations, causing Cloud Controller Manager to fail to initialize nodes. This results in nodes remaining tainted with
        "node.cloudprovider.kubernetes.io/uninitialized", preventing cluster operators from scheduling pods and causing installation failures.  The root cause is that the MachineConfig uses systemd drop-in files with Environment= directives to set KUBELET_PROVIDERID, but systemd does not expand variables from Environment= directives in ExecStart= command lines - only variables from 
        EnvironmentFile= are expanded.
      

      Version-Release number of selected component (if applicable):

          Affects OpenShift 4.19, 4.20, 4.21 (observed in nightly releases)
        Component: Machine Config Operator / Platform External installation
        Upstream component: kubernetes/cloud-provider-aws (CCM regression in commit c111ea6c122c from Feb 23, 2025)
      

      How reproducible:

          Intermittent - timing race condition
        - If /etc/kubernetes/kubelet-env exists from a previous MCO run, nodes work
        - Fresh nodes without pre-existing kubelet-env fail with empty provider ID
      

      Steps to Reproduce:

        Reproducible in CI periodic jobs:
        - periodic-ci-openshift-release-master-nightly-4.19-e2e-external-aws-ccm
        - periodic-ci-openshift-release-master-nightly-4.20-e2e-external-aws-ccm
        - periodic-ci-openshift-release-master-nightly-4.21-e2e-external-aws-ccm

      Actual results:

        - Kubelet runs with empty --provider-id= argument
        - Nodes remain tainted with node.cloudprovider.kubernetes.io/uninitialized
        - CCM fails to initialize nodes
        - Cluster operators cannot schedule pods on uninitialized nodes
        - Installation times out
      
        Example from live cluster debugging:
        $ cat /etc/systemd/system/kubelet.service.d/20-providerid.conf
        [Service]
        Environment=KUBELET_PROVIDERID=aws:///us-east-1a/i-0037cc9e4e43b84b7  $ ps aux | grep kubelet
        --provider-id=   # EMPTY!
      

      Expected results:

        - Kubelet should run with --provider-id=aws:///zone/instance-id
        - CCM should successfully initialize nodes and remove uninitialized taint
        - Cluster operators should be able to schedule pods on all nodes
        - Installation should complete successfully
      

      Additional info:

        **Root Cause:**
      
        Systemd does not expand variables from Environment= directives in ExecStart= lines. Only variables from EnvironmentFile= are expanded in command execution.  
      
      UPI kubelet.service contains:
        EnvironmentFile=-/etc/kubernetes/kubelet-env
        ExecStart=/usr/local/bin/kubenswrapper /usr/bin/kubelet --provider-id=${KUBELET_PROVIDERID}  
      
      The kubelet.service already reads /etc/kubernetes/kubelet-env, but the MachineConfig was writing to a drop-in file instead.  
      
      **Fix:**
        Write KUBELET_PROVIDERID to /etc/kubernetes/kubelet-env instead of using systemd drop-in files.  
      
      **Fix PR:** https://github.com/openshift/release/pull/71807  
      
      **Related Upstream Issue:**
        CCM regression in kubernetes/cloud-provider-aws commit c111ea6c122c (PR kubernetes/cloud-provider-aws#1109, Feb 23, 2025) - InstanceMetadata() now bypasses node informer cache, exposing timing issues where nodes don't have provider ID set.  
      
      **Affected Code:**
        ci-operator/step-registry/platform-external/pre/conf/manifests/platform-external-pre-conf-manifests-commands.sh  **Live Cluster Evidence:**
        In failed rehearsal jobs, 2 out of 3 master nodes had empty provider ID while 1 master and all 3 workers had correct provider ID (race condition - workers likely had pre-existing kubelet-env from MCO).
      

              rhn-support-mrbraga Marco Braga
              rhn-support-mrbraga Marco Braga
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: