Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-28974

Machine stuck in Provisioned when the cluster is upgraded from 4.1 to 4.15

XMLWordPrintable

    • Critical
    • No
    • MCO Sprint 252, MCO Sprint 255
    • 2
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, after upgrading from {product-title} 4.1 or 4.2 to version 4.15, some machines could get stuck during provisioning and never become available. This was because the `machine-config-daemon-firstboot` service was failing due to having an incompatible `machine-config-daemon` binary on those nodes. With this release, the correct `machine-config-daemon` binary is copied to nodes before booting. (link:https://issues.redhat.com/browse/OCPBUGS-28974[*OCPBUGS-28974])
      Show
      * Previously, after upgrading from {product-title} 4.1 or 4.2 to version 4.15, some machines could get stuck during provisioning and never become available. This was because the `machine-config-daemon-firstboot` service was failing due to having an incompatible `machine-config-daemon` binary on those nodes. With this release, the correct `machine-config-daemon` binary is copied to nodes before booting. (link: https://issues.redhat.com/browse/OCPBUGS-28974 [* OCPBUGS-28974 ])
    • Bug Fix
    • Done
    • Provision

      Description of problem:

      Machine stuck in Provisioned when the cluster is upgraded from 4.1 to 4.15    

      Version-Release number of selected component (if applicable):

      Upgrade from 4.1 to 4.15
      4.1.41-x86_64, 4.2.36-x86_64, 4.3.40-x86_64, 4.4.33-x86_64, 4.5.41-x86_64, 4.6.62-x86_64, 4.7.60-x86_64, 4.8.57-x86_64, 4.9.59-x86_64, 4.10.67-x86_64, 4.11 nightly, 4.12 nightly, 4.13 nightly, 4.14 nightly, 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest    

      How reproducible:

      Seems always, the issue was found in our prow ci, and I also reproduce it.    

      Steps to Reproduce:

      1.Create an aws IPI 4.1 cluster, then upgrade it one by one to 4.14
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.11.0-0.nightly-2024-01-19-110702   True        True          26m     Working towards 4.12.0-0.nightly-2024-02-04-062856: 654 of 830 done (78% complete), waiting on authentication, openshift-apiserver, openshift-controller-manager
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-0.nightly-2024-02-04-062856   True        False         5m12s   Cluster version is 4.12.0-0.nightly-2024-02-04-062856
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.12.0-0.nightly-2024-02-04-062856   True        True          61m     Working towards 4.13.0-0.nightly-2024-02-04-042638: 713 of 841 done (84% complete), waiting up to 40 minutes on machine-config
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.0-0.nightly-2024-02-04-042638   True        False         10m     Cluster version is 4.13.0-0.nightly-2024-02-04-042638
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.13.0-0.nightly-2024-02-04-042638   True        True          17m     Working towards 4.14.0-0.nightly-2024-02-02-173828: 233 of 860 done (27% complete), waiting on control-plane-machine-set, machine-api
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2024-02-02-173828   True        False         18m     Cluster version is 4.14.0-0.nightly-2024-02-02-173828     
      
      2.When it upgrade to 4.14, check the machine scale successfully
      liuhuali@Lius-MacBook-Pro huali-test %  oc create -f ms1.yaml 
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa created
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
      NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a    1         1         1       1           14h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa   0         0                             3s
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f    2         2         2       2           14h
      liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=1
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                  PHASE     TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa-mt9kh   Running   m6a.xlarge   us-east-1   us-east-1a   15m
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running   m6a.xlarge   us-east-1   us-east-1f   15h
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                           STATUS   ROLES    AGE     VERSION
      ip-10-0-128-51.ec2.internal    Ready    master   15h     v1.27.10+28ed2d7
      ip-10-0-143-198.ec2.internal   Ready    worker   14h     v1.27.10+28ed2d7
      ip-10-0-143-64.ec2.internal    Ready    worker   14h     v1.27.10+28ed2d7
      ip-10-0-143-80.ec2.internal    Ready    master   15h     v1.27.10+28ed2d7
      ip-10-0-144-123.ec2.internal   Ready    master   15h     v1.27.10+28ed2d7
      ip-10-0-147-94.ec2.internal    Ready    worker   14h     v1.27.10+28ed2d7
      ip-10-0-158-61.ec2.internal    Ready    worker   3m40s   v1.27.10+28ed2d7
      liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa --replicas=0
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa scaled
      liuhuali@Lius-MacBook-Pro huali-test % oc get node                                                                   
      NAME                           STATUS   ROLES    AGE   VERSION
      ip-10-0-128-51.ec2.internal    Ready    master   15h   v1.27.10+28ed2d7
      ip-10-0-143-198.ec2.internal   Ready    worker   15h   v1.27.10+28ed2d7
      ip-10-0-143-64.ec2.internal    Ready    worker   15h   v1.27.10+28ed2d7
      ip-10-0-143-80.ec2.internal    Ready    master   15h   v1.27.10+28ed2d7
      ip-10-0-144-123.ec2.internal   Ready    master   15h   v1.27.10+28ed2d7
      ip-10-0-147-94.ec2.internal    Ready    worker   15h   v1.27.10+28ed2d7
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine                                                                
      NAME                                                 PHASE     TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                  Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                  Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                  Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt   Running   m6a.xlarge   us-east-1   us-east-1a   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb   Running   m6a.xlarge   us-east-1   us-east-1f   15h
      liuhuali@Lius-MacBook-Pro huali-test % oc delete machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa 
      machineset.machine.openshift.io "ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1aa" deleted
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2024-02-02-173828   True        False         43m     Cluster version is 4.14.0-0.nightly-2024-02-02-173828     
      
      3.Upgrade to 4.15
      As upgrade to 4.15 nightly stuck on operator-lifecycle-manager-packageserver which is a bug https://issues.redhat.com/browse/OCPBUGS-28744  so I build image with the fix pr (job build openshift/operator-framework-olm#679 succeeded) and upgrade to the image, upgrade successfully
      
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.14.0-0.nightly-2024-02-02-173828   True        True          7s      Working towards 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest: 10 of 875 done (1% complete)
      liuhuali@Lius-MacBook-Pro huali-test % oc get clusterversion
      NAME      VERSION                                                   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         23m     Cluster version is 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest
      liuhuali@Lius-MacBook-Pro huali-test % oc get co
      NAME                                       VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      baremetal                                  4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      11h     
      cloud-controller-manager                   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      8h      
      cloud-credential                           4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      cluster-autoscaler                         4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      config-operator                            4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      13h     
      console                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      3h19m   
      control-plane-machine-set                  4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      5h      
      csi-snapshot-controller                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      7h10m   
      dns                                        4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      etcd                                       4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      image-registry                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      33m     
      ingress                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      insights                                   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      kube-apiserver                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      kube-controller-manager                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      kube-scheduler                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      14h     
      kube-storage-version-migrator              4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      34m     
      machine-api                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      machine-approver                           4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      13h     
      machine-config                             4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      10h     
      marketplace                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      10h     
      monitoring                                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      network                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      node-tuning                                4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      56m     
      openshift-apiserver                        4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      openshift-controller-manager               4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      4h56m   
      openshift-samples                          4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      58m     
      operator-lifecycle-manager                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      operator-lifecycle-manager-catalog         4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      operator-lifecycle-manager-packageserver   4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      57m     
      service-ca                                 4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      16h     
      storage                                    4.15.0-0.ci.test-2024-02-05-022753-ci-ln-7mxfqgt-latest   True        False         False      9h      
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                 PHASE     TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                  Running   m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                  Running   m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                  Running   m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt   Running   m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k   Running   m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb   Running   m6a.xlarge   us-east-1   us-east-1f   16h 
      
      4.Check machine scale stuck in Provisioned, no csr pending
      
      liuhuali@Lius-MacBook-Pro huali-test % oc create -f ms1.yaml 
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 created
      liuhuali@Lius-MacBook-Pro huali-test % oc get machineset
      NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a    1         1         1       1           16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1   0         0                             6s
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f    2         2         2       2           16h
      liuhuali@Lius-MacBook-Pro huali-test % oc scale machineset ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 --replicas=1
      machineset.machine.openshift.io/ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1 scaled
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                  PHASE          TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running        m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running        m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running        m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running        m6a.xlarge   us-east-1   us-east-1a   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877   Provisioning   m6a.xlarge   us-east-1   us-east-1a   4s
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running        m6a.xlarge   us-east-1   us-east-1f   16h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running        m6a.xlarge   us-east-1   us-east-1f   16h
      liuhuali@Lius-MacBook-Pro huali-test % oc get machine
      NAME                                                  PHASE         TYPE         REGION      ZONE         AGE
      ci-op-trzci0vq-8a8c4-dq95h-master-0                   Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-master-1                   Running       m6a.xlarge   us-east-1   us-east-1a   18h
      ci-op-trzci0vq-8a8c4-dq95h-master-2                   Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a-pqnqt    Running       m6a.xlarge   us-east-1   us-east-1a   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877   Provisioned   m6a.xlarge   us-east-1   us-east-1a   97m
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-h2f9k    Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f-lgmjb    Running       m6a.xlarge   us-east-1   us-east-1f   18h
      ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1f1-4ln47   Provisioned   m6a.xlarge   us-east-1   us-east-1f   50m
      liuhuali@Lius-MacBook-Pro huali-test % oc get node
      NAME                           STATUS   ROLES    AGE   VERSION
      ip-10-0-128-51.ec2.internal    Ready    master   18h   v1.28.6+a373c1b
      ip-10-0-143-198.ec2.internal   Ready    worker   18h   v1.28.6+a373c1b
      ip-10-0-143-64.ec2.internal    Ready    worker   18h   v1.28.6+a373c1b
      ip-10-0-143-80.ec2.internal    Ready    master   18h   v1.28.6+a373c1b
      ip-10-0-144-123.ec2.internal   Ready    master   18h   v1.28.6+a373c1b
      ip-10-0-147-94.ec2.internal    Ready    worker   18h   v1.28.6+a373c1b
      liuhuali@Lius-MacBook-Pro huali-test % oc get csr
      NAME        AGE   SIGNERNAME                                    REQUESTOR                                  REQUESTEDDURATION   CONDITION
      csr-596n7   21m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-147-94.ec2.internal    <none>              Approved,Issued
      csr-7nr9m   42m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-147-94.ec2.internal    <none>              Approved,Issued
      csr-bc9n7   16m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-128-51.ec2.internal    <none>              Approved,Issued
      csr-dmk27   18m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-128-51.ec2.internal    <none>              Approved,Issued
      csr-ggkgd   64m   kubernetes.io/kube-apiserver-client-kubelet   system:node:ip-10-0-143-198.ec2.internal   <none>              Approved,Issued
      csr-rs9cz   70m   kubernetes.io/kubelet-serving                 system:node:ip-10-0-143-80.ec2.internal    <none>              Approved,Issued
      liuhuali@Lius-MacBook-Pro huali-test %     

      Actual results:

       Machine stuck in Provisioned   

      Expected results:

        Machine should get Running  

      Additional info:

      Must gather: https://drive.google.com/file/d/1TrZ_mb-cHKmrNMsuFl9qTdYo_eNPuF_l/view?usp=sharing 
      I can see the provisioned machine on AWS console: https://drive.google.com/file/d/1-OcsmvfzU4JBeGh5cil8P2Hoe5DQsmqF/view?usp=sharing
      System log of ci-op-trzci0vq-8a8c4-dq95h-worker-us-east-1a1-5g877: https://drive.google.com/file/d/1spVT_o0S4eqeQxE5ivttbAazCCuSzj1e/view?usp=sharing 
      Some log on the instance: https://drive.google.com/file/d/1zjxPxm61h4L6WVHYv-w7nRsSz5Fku26w/view?usp=sharing 
          

            rhn-engineering-skumari Sinny Kumari
            huliu@redhat.com Huali Liu
            Sergio Regidor de la Rosa Sergio Regidor de la Rosa
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: