Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-25887

[IBMCloud] cluster install fails nodes stuck in node.cloudprovider.kubernetes.io/uninitialized

    • Critical
    • Yes
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Cluster install fails on IBMCloud, nodes tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule

      Version-Release number of selected component (if applicable):

      from 4.16.0-0.nightly-2023-12-22-210021
      
      last PASS version: 4.16.0-0.nightly-2023-12-20-061023

      How reproducible:

      Always 

      Steps to Reproduce:

          1. Install a cluster on IBMCloud, we use auto flexy template: aos-4_16/ipi-on-ibmcloud/versioned-installer
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       True          92m     Unable to apply 4.16.0-0.nightly-2023-12-25-200355: an unknown error has occurred: MultipleErrors
      liuhuali@Lius-MacBook-Pro huali-test % oc get co
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                                                                                                               
      baremetal                                                                                                                    
      cloud-controller-manager                   4.16.0-0.nightly-2023-12-25-200355   True        False         False      89m     
      cloud-credential                                                                                                             
      cluster-autoscaler                                                                                                           
      config-operator                                                                                                              
      console                                                                                                                      
      control-plane-machine-set                                                                                                    
      csi-snapshot-controller                                                                                                      
      dns                                                                                                                          
      etcd                                                                                                                         
      image-registry                                                                                                               
      ingress                                                                                                                      
      insights                                                                                                                     
      kube-apiserver                                                                                                               
      kube-controller-manager                                                                                                      
      kube-scheduler                                                                                                               
      kube-storage-version-migrator                                                                                                
      machine-api                                                                                                                  
      machine-approver                                                                                                             
      machine-config                                                                                                               
      marketplace                                                                                                                  
      monitoring                                                                                                                   
      network                                                                                                                      
      node-tuning                                                                                                                  
      openshift-apiserver                                                                                                          
      openshift-controller-manager                                                                                                 
      openshift-samples                                                                                                            
      operator-lifecycle-manager                                                                                                   
      operator-lifecycle-manager-catalog                                                                                           
      operator-lifecycle-manager-packageserver                                                                                     
      service-ca                                                                                                                   
      storage                                                                                                                       
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                        STATUS     ROLES                  AGE   VERSION
      huliu-ibma-qbg48-master-0   NotReady   control-plane,master   89m   v1.29.0+b0d609f
      huliu-ibma-qbg48-master-1   NotReady   control-plane,master   89m   v1.29.0+b0d609f
      huliu-ibma-qbg48-master-2   NotReady   control-plane,master   89m   v1.29.0+b0d609f
      liuhuali@Lius-MacBook-Pro huali-test % oc describe node huliu-ibma-qbg48-master-0
      Name:               huliu-ibma-qbg48-master-0
      Roles:              control-plane,master
      Labels:             beta.kubernetes.io/arch=amd64
                          beta.kubernetes.io/os=linux
                          kubernetes.io/arch=amd64
                          kubernetes.io/hostname=huliu-ibma-qbg48-master-0
                          kubernetes.io/os=linux
                          node-role.kubernetes.io/control-plane=
                          node-role.kubernetes.io/master=
                          node.openshift.io/os_id=rhcos
      Annotations:        volumes.kubernetes.io/controller-managed-attach-detach: true
      CreationTimestamp:  Wed, 27 Dec 2023 18:02:21 +0800
      Taints:             node-role.kubernetes.io/master:NoSchedule
                          node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                          node.kubernetes.io/not-ready:NoSchedule
      Unschedulable:      false
      Lease:
        HolderIdentity:  huliu-ibma-qbg48-master-0
        AcquireTime:     <unset>
        RenewTime:       Wed, 27 Dec 2023 19:32:24 +0800
      Conditions:
        Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
        ----             ------  -----------------                 ------------------                ------                       -------
        MemoryPressure   False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
        DiskPressure     False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
        PIDPressure      False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
        Ready            False   Wed, 27 Dec 2023 19:32:21 +0800   Wed, 27 Dec 2023 18:02:21 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
      Addresses:
      Capacity:
        cpu:                4
        ephemeral-storage:  104266732Ki
        hugepages-1Gi:      0
        hugepages-2Mi:      0
        memory:             16391716Ki
        pods:               250
      Allocatable:
        cpu:                3500m
        ephemeral-storage:  95018478229
        hugepages-1Gi:      0
        hugepages-2Mi:      0
        memory:             15240740Ki
        pods:               250
      System Info:
        Machine ID:                 0ae21a012be844f18c5871f6eaefb85b
        System UUID:                0ae21a01-2be8-44f1-8c58-71f6eaefb85b
        Boot ID:                    fbe619e2-8ff5-4cdb-b6a4-cd6830ccc568
        Kernel Version:             5.14.0-284.45.1.el9_2.x86_64
        OS Image:                   Red Hat Enterprise Linux CoreOS 416.92.202312250319-0 (Plow)
        Operating System:           linux
        Architecture:               amd64
        Container Runtime Version:  cri-o://1.28.2-9.rhaos4.15.git6d902a3.el9
        Kubelet Version:            v1.29.0+b0d609f
        Kube-Proxy Version:         v1.29.0+b0d609f
      Non-terminated Pods:          (0 in total)
        Namespace                   Name    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
        ---------                   ----    ------------  ----------  ---------------  -------------  ---
      Allocated resources:
        (Total limits may be over 100 percent, i.e., overcommitted.)
        Resource           Requests  Limits
        --------           --------  ------
        cpu                0 (0%)    0 (0%)
        memory             0 (0%)    0 (0%)
        ephemeral-storage  0 (0%)    0 (0%)
        hugepages-1Gi      0 (0%)    0 (0%)
        hugepages-2Mi      0 (0%)    0 (0%)
      Events:
        Type    Reason                   Age                From             Message
        ----    ------                   ----               ----             -------
        Normal  NodeHasNoDiskPressure    90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasNoDiskPressure
        Normal  NodeHasSufficientPID     90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientPID
        Normal  NodeHasSufficientMemory  90m (x7 over 90m)  kubelet          Node huliu-ibma-qbg48-master-0 status is now: NodeHasSufficientMemory
        Normal  RegisteredNode           90m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
        Normal  RegisteredNode           73m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
        Normal  RegisteredNode           53m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
        Normal  RegisteredNode           32m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller
        Normal  RegisteredNode           12m                node-controller  Node huliu-ibma-qbg48-master-0 event: Registered Node huliu-ibma-qbg48-master-0 in Controller 
      liuhuali@Lius-MacBook-Pro huali-test % oc get pod -n openshift-cloud-controller-manager
      NAME                                            READY   STATUS             RESTARTS         AGE
      ibm-cloud-controller-manager-787645668b-djqnr   0/1     CrashLoopBackOff   22 (2m29s ago)   90m
      ibm-cloud-controller-manager-787645668b-pgkh2   0/1     Error              15 (5m8s ago)    52m
      liuhuali@Lius-MacBook-Pro huali-test % oc describe pod ibm-cloud-controller-manager-787645668b-pgkh2 -n openshift-cloud-controller-manager
      Name:                 ibm-cloud-controller-manager-787645668b-pgkh2
      Namespace:            openshift-cloud-controller-manager
      Priority:             2000000000
      Priority Class Name:  system-cluster-critical
      Node:                 huliu-ibma-qbg48-master-2/
      Start Time:           Wed, 27 Dec 2023 18:41:23 +0800
      Labels:               infrastructure.openshift.io/cloud-controller-manager=IBMCloud
                            k8s-app=ibm-cloud-controller-manager
                            pod-template-hash=787645668b
      Annotations:          operator.openshift.io/config-hash: 82a75c6ff86a490b0dac9c8c9b91f1987da0e646a42d72c33c54cbde3c29395b
      Status:               Running
      IP:                   
      IPs:                  <none>
      Controlled By:        ReplicaSet/ibm-cloud-controller-manager-787645668b
      Containers:
        cloud-controller-manager:
          Container ID:  cri-o://c56e246f64c770146c30b7a894f6a4d974159551dbb9d1ea31c238e516a0f854
          Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218
          Image ID:      e494d0d4b28e31170a4a2792bb90701c7f1e81c78c03e3686c5f0e601801937e
          Port:          10258/TCP
          Host Port:     10258/TCP
          Command:
            /bin/bash
            -c
            #!/bin/bash
            set -o allexport
            if [[ -f /etc/kubernetes/apiserver-url.env ]]; then
              source /etc/kubernetes/apiserver-url.env
            fi
            exec /bin/ibm-cloud-controller-manager \
            --bind-address=$(POD_IP_ADDRESS) \
            --use-service-account-credentials=true \
            --configure-cloud-routes=false \
            --cloud-provider=ibm \
            --cloud-config=/etc/ibm/cloud.conf \
            --profiling=false \
            --leader-elect=true \
            --leader-elect-lease-duration=137s \
            --leader-elect-renew-deadline=107s \
            --leader-elect-retry-period=26s \
            --leader-elect-resource-namespace=openshift-cloud-controller-manager \
            --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384 \
            --v=2
            
          State:          Waiting
            Reason:       CrashLoopBackOff
          Last State:     Terminated
            Reason:       Error
            Exit Code:    1
            Started:      Wed, 27 Dec 2023 19:33:23 +0800
            Finished:     Wed, 27 Dec 2023 19:33:23 +0800
          Ready:          False
          Restart Count:  15
          Requests:
            cpu:     75m
            memory:  60Mi
          Liveness:  http-get https://:10258/healthz delay=300s timeout=160s period=10s #success=1 #failure=3
          Environment:
            POD_IP_ADDRESS:           (v1:status.podIP)
            VPCCTL_CLOUD_CONFIG:     /etc/ibm/cloud.conf
            VPCCTL_PUBLIC_ENDPOINT:  false
          Mounts:
            /etc/ibm from cloud-conf (rw)
            /etc/kubernetes from host-etc-kube (ro)
            /etc/pki/ca-trust/extracted/pem from trusted-ca (ro)
            /etc/vpc from ibm-cloud-credentials (rw)
            /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-cbd4b (ro)
      Conditions:
        Type                        Status
        PodReadyToStartContainers   True 
        Initialized                 True 
        Ready                       False 
        ContainersReady             False 
        PodScheduled                True 
      Volumes:
        trusted-ca:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      ccm-trusted-ca
          Optional:  false
        host-etc-kube:
          Type:          HostPath (bare host directory volume)
          Path:          /etc/kubernetes
          HostPathType:  Directory
        cloud-conf:
          Type:      ConfigMap (a volume populated by a ConfigMap)
          Name:      cloud-conf
          Optional:  false
        ibm-cloud-credentials:
          Type:        Secret (a volume populated by a Secret)
          SecretName:  ibm-cloud-credentials
          Optional:    false
        kube-api-access-cbd4b:
          Type:                    Projected (a volume that contains injected data from multiple sources)
          TokenExpirationSeconds:  3607
          ConfigMapName:           kube-root-ca.crt
          ConfigMapOptional:       <nil>
          DownwardAPI:             true
          ConfigMapName:           openshift-service-ca.crt
          ConfigMapOptional:       <nil>
      QoS Class:                   Burstable
      Node-Selectors:              node-role.kubernetes.io/master=
      Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                                   node.cloudprovider.kubernetes.io/uninitialized:NoSchedule op=Exists
                                   node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                                   node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                                   node.kubernetes.io/not-ready:NoSchedule op=Exists
                                   node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
      Events:
        Type     Reason     Age                    From               Message
        ----     ------     ----                   ----               -------
        Normal   Scheduled  52m                    default-scheduler  Successfully assigned openshift-cloud-controller-manager/ibm-cloud-controller-manager-787645668b-pgkh2 to huliu-ibma-qbg48-master-2
        Normal   Pulling    52m                    kubelet            Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218"
        Normal   Pulled     52m                    kubelet            Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" in 3.431s (3.431s including waiting)
        Normal   Created    50m (x5 over 52m)      kubelet            Created container cloud-controller-manager
        Normal   Started    50m (x5 over 52m)      kubelet            Started container cloud-controller-manager
        Normal   Pulled     50m (x4 over 52m)      kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76aedf175591ff1675c891e5c80d02ee7425a6b3a98c34427765f402ca050218" already present on machine
        Warning  BackOff    2m19s (x240 over 52m)  kubelet            Back-off restarting failed container cloud-controller-manager in pod ibm-cloud-controller-manager-787645668b-pgkh2_openshift-cloud-controller-manager(d7f93ecf-cd14-450e-a986-028559a775b3)
      liuhuali@Lius-MacBook-Pro huali-test % 

      Actual results:

          cluster install failed on IBMCloud

      Expected results:

          cluster install succeed on IBMCloud

      Additional info:

          

            [OCPBUGS-25887] [IBMCloud] cluster install fails nodes stuck in node.cloudprovider.kubernetes.io/uninitialized

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:0041

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Critical: OpenShift Container Platform 4.16.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:0041

            Zhaohua Sun added a comment -

            Verified

            Cluster installation is successful, clusterversion 4.16.0-0.nightly-2024-01-18-153434

            Zhaohua Sun added a comment - Verified Cluster installation is successful, clusterversion 4.16.0-0.nightly-2024-01-18-153434

            Joel Speed added a comment -

            The change in Kubelet behaviour is a change in the 1.29 Kube release, there is no intention to backport the change as this would go against our supportability contract for the 1.29 release, so fixing this for 4.16 only is acceptable 

            Joel Speed added a comment - The change in Kubelet behaviour is a change in the 1.29 Kube release, there is no intention to backport the change as this would go against our supportability contract for the 1.29 release, so fixing this for 4.16 only is acceptable 

            As far as backporting this, I currently do not see that this is necessary.
            However, whatever change occurred (within kubelet, or some other component), that no longer updated the cluster node's IP automatically, if that change gets backported by Kubernetes or RedHat, to 4.15 or earlier releases, then we would need to backport this change to the earlier IBM Cloud CCM releases as well.

            While I don't expect that to happen, the CI builds would be an indication of that happening and requiring this change. Otherwise, those builds appear to be fine as is.

            Christopher Schaefer added a comment - As far as backporting this, I currently do not see that this is necessary. However, whatever change occurred (within kubelet, or some other component), that no longer updated the cluster node's IP automatically, if that change gets backported by Kubernetes or RedHat, to 4.15 or earlier releases, then we would need to backport this change to the earlier IBM Cloud CCM releases as well. While I don't expect that to happen, the CI builds would be an indication of that happening and requiring this change. Otherwise, those builds appear to be fine as is.

            Hi jeffbnowicki,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi jeffbnowicki , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the "Target Backport Versions" field to indicate which version(s) will receive the fix.

            OpenShift Jira Bot added a comment - Looks like this bug is far enough along in the workflow that a code fix is ready. Customers and support need to know the backport plan. Please complete the " Target Backport Versions " field to indicate which version(s) will receive the fix.

            After talking with out development team, it appears we can simply drop the `--bind-address` argument from the CCM deployment config

            https://github.com/cjschaef/cluster-cloud-controller-manager-operator/commit/659f5dbe68995b8b3fd9984c35c19065e5fa1a51

             

             

            # oc --kubeconfig cluster-deploys/eu-de-ocpbugs-25887-4/auth/kubeconfig get co
            NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
            authentication                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      13h     
            baremetal                                  4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            cloud-controller-manager                   4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            cloud-credential                           4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            cluster-autoscaler                         4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            config-operator                            4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            console                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            control-plane-machine-set                  4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            csi-snapshot-controller                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            dns                                        4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            etcd                                       4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            image-registry                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            ingress                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            insights                                   4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            kube-apiserver                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            kube-controller-manager                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            kube-scheduler                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            kube-storage-version-migrator              4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            machine-api                                4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            machine-approver                           4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            machine-config                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            marketplace                                4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            monitoring                                 4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            network                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            node-tuning                                4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            openshift-apiserver                        4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            openshift-controller-manager               4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            openshift-samples                          4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            operator-lifecycle-manager                 4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            operator-lifecycle-manager-catalog         4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            operator-lifecycle-manager-packageserver   4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            service-ca                                 4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     
            storage                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h
            
            
            

             

             

            That resulted in a successful 4.16 deployment, but I will run some OCP Conformance testing against the cluster to try to confirm things appear functional, before I open a PR.

            Christopher Schaefer added a comment - After talking with out development team, it appears we can simply drop the `--bind-address` argument from the CCM deployment config https://github.com/cjschaef/cluster-cloud-controller-manager-operator/commit/659f5dbe68995b8b3fd9984c35c19065e5fa1a51     # oc --kubeconfig cluster-deploys/eu-de-ocpbugs-25887-4/auth/kubeconfig get co NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE authentication                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      13h      baremetal                                  4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      cloud-controller-manager                   4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      cloud-credential                           4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      cluster-autoscaler                         4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      config- operator                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      console                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      control-plane-machine-set                  4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      csi-snapshot-controller                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      dns                                        4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      etcd                                       4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      image-registry                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      ingress                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      insights                                   4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      kube-apiserver                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      kube-controller-manager                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      kube-scheduler                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      kube-storage-version-migrator              4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      machine-api                                4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      machine-approver                           4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      machine-config                             4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      marketplace                                4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      monitoring                                 4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      network                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      node-tuning                                4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      openshift-apiserver                        4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      openshift-controller-manager               4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      openshift-samples                          4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      operator -lifecycle-manager                 4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      operator -lifecycle-manager-catalog         4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      operator -lifecycle-manager-packageserver   4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      service-ca                                 4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h      storage                                    4.16.0-0.nightly-2024-01-05-154400   True        False         False      16h     That resulted in a successful 4.16 deployment, but I will run some OCP Conformance testing against the cluster to try to confirm things appear functional, before I open a PR.

            Christopher Schaefer added a comment - - edited

            This behavior is expected from `kube-rbac-proxy` before the CP is healthy.

            So the core issue is that the CCM is expecting the pod to have an IP address, from the host node.

            But kubelet does not appear to update this update any more in 1.29/4.16.

            So, our CCM will need to change to use loopback (127.0.0.1), rather than attempting to bind to the node IP.

            https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/51fb8a64880a81d94c83643bc73cd8b3a9986dff/pkg/cloud/ibm/assets/deployment.yaml#L80

            I will have to check with our development team on this, in order to get this fixed for 4.16.

            Christopher Schaefer added a comment - - edited This behavior is expected from `kube-rbac-proxy` before the CP is healthy. So the core issue is that the CCM is expecting the pod to have an IP address, from the host node. But kubelet does not appear to update this update any more in 1.29/4.16. So, our CCM will need to change to use loopback (127.0.0.1), rather than attempting to bind to the node IP. https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/51fb8a64880a81d94c83643bc73cd8b3a9986dff/pkg/cloud/ibm/assets/deployment.yaml#L80 I will have to check with our development team on this, in order to get this fixed for 4.16.

            Looking at the `kube-rbac-proxy` container running on one of the CP nodes (part of CCCM), fails to find a TLS cert

             

             

            I0108 23:23:42.099443       1 kube-rbac-proxy.go:399] Reading certificate files
            E0108 23:23:42.099515       1 run.go:74] "command failed" err="failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/private/tls.crt: no such file or directory"
            
            
            

             

            Which I don't see that that mounted directory contains any files (no certs).

             
            {
            "destination": "/etc/tls/private",
            "type": "bind",
            "source": "/var/lib/kubelet/pods/9199b502-0de3-4a0f-bd85-204feb0a7152/volumes/kubernetes.io~secret/cloud-controller-manager-operator-tls",
            "options": [
            "ro",
            "rbind",
            "rprivate",
            "bind"
            ]
            },
             
            sudo ls /var/lib/kubelet/pods/9199b502-0de3-4a0f-bd85-204feb0a7152/volumes/kubernetes.io~secret/cloud-controller-manager-operator-tls

             

            I'm unsure if this is expected or not. May need to perform some comparisons with 4.15 further to find out more.

            Christopher Schaefer added a comment - Looking at the `kube-rbac-proxy` container running on one of the CP nodes (part of CCCM), fails to find a TLS cert     I0108 23:23:42.099443       1 kube-rbac-proxy.go:399] Reading certificate files E0108 23:23:42.099515       1 run.go:74] "command failed" err= "failed to initialize certificate reloader: error loading certificates: error loading certificate: open /etc/tls/ private /tls.crt: no such file or directory"   Which I don't see that that mounted directory contains any files (no certs).   { "destination": "/etc/tls/private", "type": "bind", "source": "/var/lib/kubelet/pods/9199b502-0de3-4a0f-bd85-204feb0a7152/volumes/kubernetes.io~secret/cloud-controller-manager-operator-tls", "options": [ "ro", "rbind", "rprivate", "bind" ] },   sudo ls /var/lib/kubelet/pods/9199b502-0de3-4a0f-bd85-204feb0a7152/volumes/kubernetes.io~secret/cloud-controller-manager-operator-tls   I'm unsure if this is expected or not. May need to perform some comparisons with 4.15 further to find out more.

            PowerVS is experiencing the same issue, where CP nodes have no `status.addresses`, causing the CCM to fail (as it is dependent on node IP).

            https://issues.redhat.com/browse/OCPBUGS-26494

            Christopher Schaefer added a comment - PowerVS is experiencing the same issue, where CP nodes have no `status.addresses`, causing the CCM to fail (as it is dependent on node IP). https://issues.redhat.com/browse/OCPBUGS-26494

              jeffbnowicki Jeff Nowicki
              huliu@redhat.com Huali Liu
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: