Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3784

[Upgrade Issue] 4.10 to 4.11 upgrade is failed due to "failed to drain node xxxx"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • None
    • 4.11
    • Storage
    • None
    • None
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Hi, Upgrade is failing on Azure platform with below mentioned details. 
      
      Linking the original reproduction steps: 
      https://issues.redhat.com/browse/OCPQE-12728

       

      Version-Release number of selected component (if applicable):

      4.10

      How reproducible:

      2 times reproducible out of 3 tries(2/3)

      Steps to Reproduce:

      1. Create cluster with mentioned details below.
      2. Run upgrade producer job: Runs Precheck tc before upgrade
      3. Run upgrade runner job: To upgrade cluster (4.10.41-x86_64 => 4.11.0-0.nightly-2022-11-15-184013) Upgrade of cluster failing, as node failed to upgrade. 

      Create cluster details. 
      Flexy job id: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/156138/console
      Template: private-templates/functionality-testing/aos-4_10/ipi-on-azure/versioned-installer-fully_private_cluster-NAT-ci
      Profile: 69_IPI on Azure & fully private
      Payload: 4.10.41-x86_64
      launcher variables: 
         vm_type_masters: Standard_F16s
         vm_type_workers: Standard_F8s
      
      rohitpatil@ropatil-mac Downloads % oc get nodes
      NAME                                      STATUS   ROLES    AGE    VERSION
      ropatil1611az-gs6zj-master-0              Ready    master   140m   v1.23.12+7566c4d
      ropatil1611az-gs6zj-master-1              Ready    master   140m   v1.23.12+7566c4d
      ropatil1611az-gs6zj-master-2              Ready    master   140m   v1.23.12+7566c4d
      ropatil1611az-gs6zj-worker-westus-j4ngg   Ready    worker   130m   v1.23.12+7566c4d
      ropatil1611az-gs6zj-worker-westus-mk98k   Ready    worker   130m   v1.23.12+7566c4d
      ropatil1611az-gs6zj-worker-westus-ssrst   Ready    worker   130m   v1.23.12+7566c4d
      
      rohitpatil@ropatil-mac Downloads % oc get sc
      NAME                        PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
      managed-csi                 disk.csi.azure.com         Delete          WaitForFirstConsumer   true                   139m
      managed-premium (default)   kubernetes.io/azure-disk   Delete          WaitForFirstConsumer   true                   139m
      
      rohitpatil@ropatil-mac Downloads % oc get clusterversion
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.10.41   True        False         135m    Cluster version is 4.10.41
      
      Upgrade producer: To run all the upgrade precheck tc 
      https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/ginkgo-test/128901/console
      
      11-16 19:45:32.347  8 pass, 2 skip (4m7s)
      11-16 19:45:32.347  The Case Execution Summary:
      11-16 19:45:32.347   PASS OCP-22615 Author:xzha prepare to check the OLM status 
      11-16 19:45:32.347   PASS OCP-22618 Author:xzha prepare to check the marketplace status 
      11-16 19:45:32.347   PASS OCP-48669 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] block volumes resize off-line 
      11-16 19:45:32.347   PASS OCP-49496 Author:ropatil Prepare [CSIMigration] PVCs created with in-tree storageclass,mountOptions are processed by CSI Driver after CSI migration is enabled 
      11-16 19:45:32.347   PASS OCP-49678 Author:ropatil Prepare [CSIMigration] PVCs created with in-tree storageclass, block volume are processed by CSI Driver after CSI migration is enabled 
      11-16 19:45:32.347   PASS OCP-50362 Author:jmekkatt Prepare Upgrade checks when cluster has bad admission webhooks [Serial] 
      11-16 19:45:32.347   PASS OCP-50425 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] [Filesystem] volumes resize off-line 
      11-16 19:45:32.347   SKIP OCP-50427 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] [Filesystem] volumes resize on-line 
      11-16 19:45:32.347   SKIP OCP-50428 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] block volumes resize on-line 
      11-16 19:45:32.347   PASS OCP-55213 Author:gkarager Upgrade should succeed when custom SCC is created with readOnlyRootFilesystem set to true 
      rohitpatil@ropatil-mac Downloads % oc get pvc,pod -n migration-upgrade-49678 -o wide
      NAME STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE     VOLUMEMODE
      persistentvolumeclaim/mypvc-49678   Bound    pvc-cf8f4da1-5211-406e-986b-55c19439ee1a   2Gi        RWO            mysc-49678     8m57s   Block
      NAME  READY   STATUS    RESTARTS   AGE IP NODE   NOMINATED NODE   READINESS GATES
      pod/mydep-49678-6c64845f74-kvbp4   1/1     Running   0          8m53s   10.129.2.22   ropatil1611az-gs6zj-worker-westus-j4ngg   <none>           <none>

      Upgrade runner:  https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-runner/24324/console

      After upgrade:

      rohitpatil@ropatil-mac Downloads % oc get clusterversion
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.11.0-0.nightly-2022-11-15-184013   True        False         51m     Cluster version is 4.11.0-0.nightly-2022-11-15-184013 

       

      rohitpatil@ropatil-mac Downloads % oc get nodes
      NAME                                      STATUS                     ROLES    AGE     VERSION
      ropatil1611az-gs6zj-master-0              Ready                      master   4h30m   v1.24.6+5658434
      ropatil1611az-gs6zj-master-1              Ready                      master   4h30m   v1.24.6+5658434
      ropatil1611az-gs6zj-master-2              Ready                      master   4h30m   v1.24.6+5658434
      ropatil1611az-gs6zj-worker-westus-j4ngg   Ready,SchedulingDisabled   worker   4h20m   v1.23.12+7566c4d
      ropatil1611az-gs6zj-worker-westus-mk98k   Ready                      worker   4h20m   v1.24.6+5658434
      ropatil1611az-gs6zj-worker-westus-ssrst   Ready                      worker   4h20m   v1.23.12+7566c4d 
      
      rohitpatil@ropatil-mac Downloads % oc get pvc,pod -n migration-upgrade-49678 -o wide NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE    VOLUMEMODE persistentvolumeclaim/mypvc-49678   Bound    pvc-cf8f4da1-5211-406e-986b-55c19439ee1a   2Gi        RWO            mysc-49678     123m   BlockNAME                               READY   STATUS        RESTARTS   AGE    IP            NODE                                      NOMINATED NODE   READINESS GATES pod/mydep-49678-6c64845f74-kvbp4   0/1     Terminating   0          123m   10.129.2.22   ropatil1611az-gs6zj-worker-westus-j4ngg   <none>           <none> pod/mydep-49678-6c64845f74-szxkd   1/1     Running       0          63m    10.131.0.25   ropatil1611az-gs6zj-worker-westus-mk98k   <none>           <none>
      

       

      Actual results:

      Upgrade of cluster is failing as 1 node reached to scheduling disabled state bec pod reached to Terminating state. 

      Expected results:

      Upgrade of cluster should work fine. 

      Additional info:

      1. This observation is basically on this template/Profile. 
      2. This issue is seen only on pod wrt Block level volume mode
      3. Additional log analysis or node upgrade failure reason can be find here: https://issues.redhat.com/browse/OCPQE-12728

       

              fbertina@redhat.com Fabio Bertinatto
              ropatil@redhat.com Rohit Patil
              Rohit Patil Rohit Patil
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: