Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55383

MAPI machine has uninitialized taints when using custom dhcp on AWS (and CAPI machine stuck in Provisioned)

XMLWordPrintable

    • Critical
    • Yes
    • CLOUD Sprint 270, CLOUD Sprint 271
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: The script used to fetch the provider ID from AWS previously wrote a systemd drop in unit for kubelet as part of its output. Drop in units should not be written dynamically as systemd does not guarantee that they will be loaded when the system starts the intended service

      Consequence: Sometimes the drop in did not load in time and Kubelet started without a provider ID

      Fix: Update the script to write to a well known environment file that is always configured within Kubelet

      Result: The provider ID is consistently set on kubelet startup
      Show
      Cause: The script used to fetch the provider ID from AWS previously wrote a systemd drop in unit for kubelet as part of its output. Drop in units should not be written dynamically as systemd does not guarantee that they will be loaded when the system starts the intended service Consequence: Sometimes the drop in did not load in time and Kubelet started without a provider ID Fix: Update the script to write to a well known environment file that is always configured within Kubelet Result: The provider ID is consistently set on kubelet startup
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-50905. The following is the description of the original issue:

      Description of problem:

          When using custom dhcp on AWS, MAPI machine get Running but node has uninitialized taints; CAPI machine stuck in Provisioned and csr pending

      Version-Release number of selected component (if applicable):

          4.19.0-0.nightly-2025-02-14-215306

      How reproducible:

          seems always for MAPI machine when scaling, and high incidence ratio for CAPI machine

      Steps to Reproduce:

          1.Install a 4.19 AWS cluster, we use automated template ipi-on-aws/versioned-installer-techpreview-ci
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.19.0-0.nightly-2025-02-14-215306   True        False         100m    Cluster version is 4.19.0-0.nightly-2025-02-14-215306
      
          2.Create a custom dhcp, then swap the VPC to use the custom dhcp on AWS console 
      
          3.Scale a worker machineset, the machine get Running, but the node has uninitialized taints
      
      liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset huliu-aws217a-lts7q-worker-us-east-2a --replicas=2
      machineset.machine.openshift.io/huliu-aws217a-lts7q-worker-us-east-2a scaled
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine                          
      NAME                                          PHASE     TYPE         REGION      ZONE         AGE
      huliu-aws217a-lts7q-master-0                  Running   m6i.xlarge   us-east-2   us-east-2a   141m
      huliu-aws217a-lts7q-master-1                  Running   m6i.xlarge   us-east-2   us-east-2b   141m
      huliu-aws217a-lts7q-master-2                  Running   m6i.xlarge   us-east-2   us-east-2c   141m
      huliu-aws217a-lts7q-worker-us-east-2a-cm5c9   Running   m6i.xlarge   us-east-2   us-east-2a   137m
      huliu-aws217a-lts7q-worker-us-east-2a-wz82p   Running   m6i.xlarge   us-east-2   us-east-2a   16m
      huliu-aws217a-lts7q-worker-us-east-2b-w2gg5   Running   m6i.xlarge   us-east-2   us-east-2b   137m
      huliu-aws217a-lts7q-worker-us-east-2c-rfm65   Running   m6i.xlarge   us-east-2   us-east-2c   137m
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                                        STATUS   ROLES                  AGE    VERSION
      ip-10-0-16-147.example.com                  Ready    worker                 24m    v1.32.1
      ip-10-0-2-172.us-east-2.compute.internal    Ready    control-plane,master   151m   v1.32.1
      ip-10-0-25-84.us-east-2.compute.internal    Ready    worker                 141m   v1.32.1
      ip-10-0-35-16.us-east-2.compute.internal    Ready    control-plane,master   149m   v1.32.1
      ip-10-0-38-54.us-east-2.compute.internal    Ready    worker                 141m   v1.32.1
      ip-10-0-73-150.us-east-2.compute.internal   Ready    worker                 145m   v1.32.1
      ip-10-0-73-232.us-east-2.compute.internal   Ready    control-plane,master   151m   v1.32.1
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get node ip-10-0-16-147.example.com  -oyaml |grep -A5 taints
        taints:
        - effect: NoSchedule
          key: node.cloudprovider.kubernetes.io/uninitialized
          value: "true"
      status:
        addresses:
           
      4. Create a CAPI machine, the machine get Running, and no taints on the node. (By the way, I also encountered the CAPI machine stuck in Provisioned when creating before) But when I scale it to 2, the new CAPI machine stuck in Provisioned, and csr Pending
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine.c
      NAME            CLUSTER               NODENAME                     PROVIDERID                              PHASE         AGE   VERSION
      capi-ms-swmg9   huliu-aws217a-lts7q                                aws:///us-east-2b/i-0dc359eb8b745f27f   Provisioned   21m   
      capi-ms-v5vjl   huliu-aws217a-lts7q   ip-10-0-47-232.example.com   aws:///us-east-2b/i-0c7d74925b9cae64f   Running       31m   
      liuhuali@Lius-MacBook-Pro huali-test % oc get csr
      NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION
      csr-4f6ls   17m     kubernetes.io/kube-apiserver-client           system:node:ip-10-0-45-27.example.com                                       24h                 Approved,Issued
      csr-5hrxs   65m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Approved,Issued
      csr-79pcj   64m     kubernetes.io/kube-apiserver-client           system:node:ip-10-0-16-147.example.com                                      24h                 Approved,Issued
      csr-7xgz8   27m     kubernetes.io/kube-apiserver-client           system:node:ip-10-0-47-232.example.com                                      24h                 Approved,Issued
      csr-9fqhh   64m     kubernetes.io/kube-apiserver-client           system:node:ip-10-0-16-147.example.com                                      24h                 Approved,Issued
      csr-b9lvw   28m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Approved,Issued
      csr-bh4bj   27m     kubernetes.io/kube-apiserver-client           system:node:ip-10-0-47-232.example.com                                      24h                 Approved,Issued
      csr-grl8d   18m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Approved,Issued
      csr-nfzws   64m     kubernetes.io/kubelet-serving                 system:node:ip-10-0-16-147.example.com                                      <none>              Approved,Issued
      csr-p6zw7   27m     kubernetes.io/kubelet-serving                 system:node:ip-10-0-47-232.example.com                                      <none>              Approved,Issued
      csr-rbn7k   17m     kubernetes.io/kube-apiserver-client           system:node:ip-10-0-45-27.example.com                                       24h                 Approved,Issued
      csr-rrwdx   2m51s   kubernetes.io/kubelet-serving                 system:node:ip-10-0-45-27.example.com                                       <none>              Pending
      csr-wxrpf   17m     kubernetes.io/kubelet-serving                 system:node:ip-10-0-45-27.example.com                                       <none>              Pending
      
      5. I scale another worker machineset to 2, the new machine get Running and has uninitialized taints. But I create a new worker machineset, the machine get Running hasn't uninitialized taints.
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine -o wide                  
      NAME                                           PHASE     TYPE         REGION      ZONE         AGE     NODE                                        PROVIDERID                              STATE
      huliu-aws217a-lts7q-master-0                   Running   m6i.xlarge   us-east-2   us-east-2a   4h15m   ip-10-0-2-172.us-east-2.compute.internal    aws:///us-east-2a/i-029a670b32f1a0ab8   running
      huliu-aws217a-lts7q-master-1                   Running   m6i.xlarge   us-east-2   us-east-2b   4h15m   ip-10-0-35-16.us-east-2.compute.internal    aws:///us-east-2b/i-03b0d624a0d77296e   running
      huliu-aws217a-lts7q-master-2                   Running   m6i.xlarge   us-east-2   us-east-2c   4h15m   ip-10-0-73-232.us-east-2.compute.internal   aws:///us-east-2c/i-00815b8b0af77f7d2   running
      huliu-aws217a-lts7q-worker-us-east-2a-cm5c9    Running   m6i.xlarge   us-east-2   us-east-2a   4h11m   ip-10-0-25-84.us-east-2.compute.internal    aws:///us-east-2a/i-011a38e96cda97dc7   running
      huliu-aws217a-lts7q-worker-us-east-2a-wz82p    Running   m6i.xlarge   us-east-2   us-east-2a   130m    ip-10-0-16-147.example.com                  aws:///us-east-2a/i-06069479b9e0f14a1   running
      huliu-aws217a-lts7q-worker-us-east-2aa-w589d   Running   m6i.xlarge   us-east-2   us-east-2a   21m     ip-10-0-18-230.example.com                  aws:///us-east-2a/i-0842dcd42e4a170fa   running
      huliu-aws217a-lts7q-worker-us-east-2b-w2gg5    Running   m6i.xlarge   us-east-2   us-east-2b   4h11m   ip-10-0-38-54.us-east-2.compute.internal    aws:///us-east-2b/i-0ece9bc2cd89b2e6e   running
      huliu-aws217a-lts7q-worker-us-east-2c-nl2bt    Running   m6i.xlarge   us-east-2   us-east-2c   55m     ip-10-0-91-174.example.com                  aws:///us-east-2c/i-0ac4328744648f204   running
      huliu-aws217a-lts7q-worker-us-east-2c-rfm65    Running   m6i.xlarge   us-east-2   us-east-2c   4h11m   ip-10-0-73-150.us-east-2.compute.internal   aws:///us-east-2c/i-0179fef8bc05db99c   running
      liuhuali@Lius-MacBook-Pro huali-test % 
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get node ip-10-0-91-174.example.com  -oyaml |grep -A5 taints
        taints:
        - effect: NoSchedule
          key: node.cloudprovider.kubernetes.io/uninitialized
          value: "true"
      status:
        addresses:
      liuhuali@Lius-MacBook-Pro huali-test % oc get node ip-10-0-18-230.example.com  -oyaml |grep -A5 taints
      liuhuali@Lius-MacBook-Pro huali-test % 
      
      

      Actual results:

          MAPI machine get Running but node has uninitialized taints; CAPI machine stuck in Provisioned and csr pending

      Expected results:

          machine get Running, shouldn't have uninitialized taints

      Additional info:

          Discussion on slack: https://redhat-internal.slack.com/archives/GE2HQ9QP4/p1739437423874929

              joelspeed Joel Speed
              openshift-crt-jira-prow OpenShift Prow Bot
              Huali Liu Huali Liu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: