Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36330

Machine stuck in Provisioned when the cluster is upgraded from 4.1 to 4.15

XMLWordPrintable

    • Critical
    • No
    • MCO Sprint 255, MCO Sprint 256
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      This fixes node scale-up issue that were happening on an OCP cluster originally installed with v4.1 or v4.2. Since, we don't have yet bootimage update functionality, nodes booted using 4.1 and 4.2 bootimages were stuck during provisioning because machine-config-daemon-firstboot.service were failing due to having incompatible machine-config-daemon binary on node. With this fix, we copy matching RHEL 8 built machine-config-daemon binary on node during node firstboot.
      Show
      This fixes node scale-up issue that were happening on an OCP cluster originally installed with v4.1 or v4.2. Since, we don't have yet bootimage update functionality, nodes booted using 4.1 and 4.2 bootimages were stuck during provisioning because machine-config-daemon-firstboot.service were failing due to having incompatible machine-config-daemon binary on node. With this fix, we copy matching RHEL 8 built machine-config-daemon binary on node during node firstboot.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-28974. The following is the description of the original issue:

      Description of problem:

      Machine stuck in Provisioned when the cluster is upgraded from 4.1 to 4.15    

      Version-Release number of selected component (if applicable):

      Upgrade from 4.1 to 4.15
      4.1.41-x86_64, 4.2.36-x86_64, 4.3.40-x86_64, 4.4.33-x86_64, 4.5.41-x86_64, 4.6.62-x86_64, 4.7.60-x86_64, 4.8.57-x86_64, 4.9.59-x86_64, 4.10.67-x86_64, 4.11 nightly, 4.12 nightly, 4.13 nightly, 4.14 nightly, 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest    

      How reproducible:

      Seems always, the issue was found in our prow ci, and I also reproduce it.    

      Steps to Reproduce:

      1.Create an aws IPI 4.1 cluster, then upgrade it one by one to 4.14
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.11.0-0.nightly-2024-01-19-110702   True        True          26m     Working towards 4.12.0-0.nightly-2024-02-04-062856: 654 of 830 done (78% complete), waiting on authentication, openshift-apiserver, openshift-controller-manager
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-0.nightly-2024-02-04-062856   True        False         5m12s   Cluster version is 4.12.0-0.nightly-2024-02-04-062856
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-0.nightly-2024-02-04-062856   True        True          61m     Working towards 4.13.0-0.nightly-2024-02-04-042638: 713 of 841 done (84% complete), waiting up to 40 minutes on machine-config
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.0-0.nightly-2024-02-04-042638   True        False         10m     Cluster version is 4.13.0-0.nightly-2024-02-04-042638
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.0-0.nightly-2024-02-04-042638   True        True          17m     Working towards 4.14.0-0.nightly-2024-02-02-173828: 233 of 860 done (27% complete), waiting on control-plane-machine-set, machine-api
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2024-02-02-173828   True        False         18m     Cluster version is 4.14.0-0.nightly-2024-02-02-173828     
      
      2.When it upgrade to 4.14, check the machine scale successfully
      liuhuali@Lius-MacBook-Pro huali-test %  oc create -f ms1.yaml 
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa created
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
      NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a    1         1         1       1           14h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa   0         0                             3s
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f    2         2         2       2           14h
      liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=1
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                  PHASE     TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa-mt9kh   Running   m6a.xlarge   us-east-1   us-east-1a   15m
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running   m6a.xlarge   us-east-1   us-east-1f   15h
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                           STATUS   ROLES    AGE     VERSION
      ip-10-0-128-51.ec2.internal    Ready    master   15h     v1.27.10+28ed2d7
      ip-10-0-143-198.ec2.internal   Ready    worker   14h     v1.27.10+28ed2d7
      ip-10-0-143-64.ec2.internal    Ready    worker   14h     v1.27.10+28ed2d7
      ip-10-0-143-80.ec2.internal    Ready    master   15h     v1.27.10+28ed2d7
      ip-10-0-144-123.ec2.internal   Ready    master   15h     v1.27.10+28ed2d7
      ip-10-0-147-94.ec2.internal    Ready    worker   14h     v1.27.10+28ed2d7
      ip-10-0-158-61.ec2.internal    Ready    worker   3m40s   v1.27.10+28ed2d7
      liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=0
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled
      liuhuali@Lius-MacBook-Pro huali-test % oc get node                                                                   
      NAME                           STATUS   ROLES    AGE   VERSION
      ip-10-0-128-51.ec2.internal    Ready    master   15h   v1.27.10+28ed2d7
      ip-10-0-143-198.ec2.internal   Ready    worker   15h   v1.27.10+28ed2d7
      ip-10-0-143-64.ec2.internal    Ready    worker   15h   v1.27.10+28ed2d7
      ip-10-0-143-80.ec2.internal    Ready    master   15h   v1.27.10+28ed2d7
      ip-10-0-144-123.ec2.internal   Ready    master   15h   v1.27.10+28ed2d7
      ip-10-0-147-94.ec2.internal    Ready    worker   15h   v1.27.10+28ed2d7
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine                                                                
      NAME                                                 PHASE     TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                  Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                  Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                  Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt   Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa 
      machineset.machine.openshift.io "ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa" deleted
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2024-02-02-173828   True        False         43m     Cluster version is 4.14.0-0.nightly-2024-02-02-173828     
      
      3.Upgrade to 4.15
      As upgrade to 4.15 nightly stuck on operator-lifecycle-manager-packageserver which is a bug https://issues.redhat.com/browse/OCPBUGS-28744  so I build image with the fix pr (job build openshift/operator-framework-olm#679 succeeded) and upgrade to the image, upgrade successfully
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2024-02-02-173828   True        True          7s      Working towards 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest: 10 of 875 done (1% complete)
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         23m     Cluster version is 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest
      liuhuali@Lius-MacBook-Pro huali-test % oc get co
      NAME                                       VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      baremetal                                  4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      11h     
      cloud-controller-manager                   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      8h      
      cloud-credential                           4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      cluster-autoscaler                         4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      config-operator                            4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      13h     
      console                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      3h19m   
      control-plane-machine-set                  4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      5h      
      csi-snapshot-controller                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      7h10m   
      dns                                        4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      etcd                                       4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      image-registry                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      33m     
      ingress                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      insights                                   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      kube-apiserver                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      kube-controller-manager                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      kube-scheduler                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      kube-storage-version-migrator              4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      34m     
      machine-api                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      machine-approver                           4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      13h     
      machine-config                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      10h     
      marketplace                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      10h     
      monitoring                                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      network                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      node-tuning                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      56m     
      openshift-apiserver                        4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      openshift-controller-manager               4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      4h56m   
      openshift-samples                          4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      58m     
      operator-lifecycle-manager                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      operator-lifecycle-manager-catalog         4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      operator-lifecycle-manager-packageserver   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      57m     
      service-ca                                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      storage                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                 PHASE     TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                  Running   m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                  Running   m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                  Running   m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt   Running   m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k   Running   m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb   Running   m6a.xlarge   us-east-1   us-east-1f   16h 
      
      4.Check machine scale stuck in Provisioned, no csr pending
      
      liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 created
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
      NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a    1         1         1       1           16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1   0         0                             6s
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f    2         2         2       2           16h
      liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 --replicas=1
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 scaled
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                  PHASE          TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running        m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running        m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running        m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running        m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877   Provisioning   m6a.xlarge   us-east-1   us-east-1a   4s
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running        m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running        m6a.xlarge   us-east-1   us-east-1f   16h
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                  PHASE         TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running       m6a.xlarge   us-east-1   us-east-1a   18h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running       m6a.xlarge   us-east-1   us-east-1a   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877   Provisioned   m6a.xlarge   us-east-1   us-east-1a   97m
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f1-4ln47   Provisioned   m6a.xlarge   us-east-1   us-east-1f   50m
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                           STATUS   ROLES    AGE   VERSION
      ip-10-0-128-51.ec2.internal    Ready    master   18h   v1.28.6+a373c1b
      ip-10-0-143-198.ec2.internal   Ready    worker   18h   v1.28.6+a373c1b
      ip-10-0-143-64.ec2.internal    Ready    worker   18h   v1.28.6+a373c1b
      ip-10-0-143-80.ec2.internal    Ready    master   18h   v1.28.6+a373c1b
      ip-10-0-144-123.ec2.internal   Ready    master   18h   v1.28.6+a373c1b
      ip-10-0-147-94.ec2.internal    Ready    worker   18h   v1.28.6+a373c1b
      liuhuali@Lius-MacBook-Pro huali-test % oc get csr
      NAME        AGE   SIGNERNAME                                    REQUESTOR                                  REQUESTEDDURATION   CONDITION
      csr-596n7   21m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-147-94.ec2.internal    <none>              Approved,Issued
      csr-7nr9m   42m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-147-94.ec2.internal    <none>              Approved,Issued
      csr-bc9n7   16m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-128-51.ec2.internal    <none>              Approved,Issued
      csr-dmk27   18m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-128-51.ec2.internal    <none>              Approved,Issued
      csr-ggkgd   64m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-143-198.ec2.internal   <none>              Approved,Issued
      csr-rs9cz   70m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-143-80.ec2.internal    <none>              Approved,Issued
      liuhuali@Lius-MacBook-Pro huali-test %     

      Actual results:

       Machine stuck in Provisioned   

      Expected results:

        Machine should get Running  

      Additional info:

      Must gather: https://drive.google.com/file/d/1TrZ_mb-cHKmrNMsuFl9qTdYo_eNPuF_l/view?usp=sharing 
      I can see the provisioned machine on AWS console: https://drive.google.com/file/d/1-OcsmvfzU4JBeGh5cil8P2Hoe5DQsmqF/view?usp=sharing
      System log of ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877: https://drive.google.com/file/d/1spVT_o0S4eqeQxE5ivttbAazCCuSzj1e/view?usp=sharing 
      Some log on the instance: https://drive.google.com/file/d/1zjxPxm61h4L6WVHYv-w7nRsSz5Fku26w/view?usp=sharing 
          

            team-mco Team MCO
            openshift-crt-jira-prow OpenShift Prow Bot
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: