Description of problem:
Hi, Upgrade is failing on Azure platform with below mentioned details. Linking the original reproduction steps: https://issues.redhat.com/browse/OCPQE-12728
Version-Release number of selected component (if applicable):
4.10
How reproducible:
2 times reproducible out of 3 tries(2/3)
Steps to Reproduce:
1. Create cluster with mentioned details below.
2. Run upgrade producer job: Runs Precheck tc before upgrade
3. Run upgrade runner job: To upgrade cluster (4.10.41-x86_64 => 4.11.0-0.nightly-2022-11-15-184013) Upgrade of cluster failing, as node failed to upgrade.
Create cluster details. Flexy job id: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/156138/console Template: private-templates/functionality-testing/aos-4_10/ipi-on-azure/versioned-installer-fully_private_cluster-NAT-ci Profile: 69_IPI on Azure & fully private Payload: 4.10.41-x86_64 launcher variables: vm_type_masters: Standard_F16s vm_type_workers: Standard_F8s rohitpatil@ropatil-mac Downloads % oc get nodes NAME STATUS ROLES AGE VERSION ropatil1611az-gs6zj-master-0 Ready master 140m v1.23.12+7566c4d ropatil1611az-gs6zj-master-1 Ready master 140m v1.23.12+7566c4d ropatil1611az-gs6zj-master-2 Ready master 140m v1.23.12+7566c4d ropatil1611az-gs6zj-worker-westus-j4ngg Ready worker 130m v1.23.12+7566c4d ropatil1611az-gs6zj-worker-westus-mk98k Ready worker 130m v1.23.12+7566c4d ropatil1611az-gs6zj-worker-westus-ssrst Ready worker 130m v1.23.12+7566c4d rohitpatil@ropatil-mac Downloads % oc get sc NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE managed-csi disk.csi.azure.com Delete WaitForFirstConsumer true 139m managed-premium (default) kubernetes.io/azure-disk Delete WaitForFirstConsumer true 139m rohitpatil@ropatil-mac Downloads % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.41 True False 135m Cluster version is 4.10.41 Upgrade producer: To run all the upgrade precheck tc https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/ginkgo-test/128901/console 11-16 19:45:32.347 8 pass, 2 skip (4m7s) 11-16 19:45:32.347 The Case Execution Summary: 11-16 19:45:32.347 PASS OCP-22615 Author:xzha prepare to check the OLM status 11-16 19:45:32.347 PASS OCP-22618 Author:xzha prepare to check the marketplace status 11-16 19:45:32.347 PASS OCP-48669 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] block volumes resize off-line 11-16 19:45:32.347 PASS OCP-49496 Author:ropatil Prepare [CSIMigration] PVCs created with in-tree storageclass,mountOptions are processed by CSI Driver after CSI migration is enabled 11-16 19:45:32.347 PASS OCP-49678 Author:ropatil Prepare [CSIMigration] PVCs created with in-tree storageclass, block volume are processed by CSI Driver after CSI migration is enabled 11-16 19:45:32.347 PASS OCP-50362 Author:jmekkatt Prepare Upgrade checks when cluster has bad admission webhooks [Serial] 11-16 19:45:32.347 PASS OCP-50425 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] [Filesystem] volumes resize off-line 11-16 19:45:32.347 SKIP OCP-50427 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] [Filesystem] volumes resize on-line 11-16 19:45:32.347 SKIP OCP-50428 Author:ropatil Prepare [CSI-Migration] [Dynamic PV] block volumes resize on-line 11-16 19:45:32.347 PASS OCP-55213 Author:gkarager Upgrade should succeed when custom SCC is created with readOnlyRootFilesystem set to true
rohitpatil@ropatil-mac Downloads % oc get pvc,pod -n migration-upgrade-49678 -o wide NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/mypvc-49678 Bound pvc-cf8f4da1-5211-406e-986b-55c19439ee1a 2Gi RWO mysc-49678 8m57s Block
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/mydep-49678-6c64845f74-kvbp4 1/1 Running 0 8m53s 10.129.2.22 ropatil1611az-gs6zj-worker-westus-j4ngg <none> <none>
Upgrade runner: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-runner/24324/console
After upgrade:
rohitpatil@ropatil-mac Downloads % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-11-15-184013 True False 51m Cluster version is 4.11.0-0.nightly-2022-11-15-184013
rohitpatil@ropatil-mac Downloads % oc get nodes NAME STATUS ROLES AGE VERSION ropatil1611az-gs6zj-master-0 Ready master 4h30m v1.24.6+5658434 ropatil1611az-gs6zj-master-1 Ready master 4h30m v1.24.6+5658434 ropatil1611az-gs6zj-master-2 Ready master 4h30m v1.24.6+5658434 ropatil1611az-gs6zj-worker-westus-j4ngg Ready,SchedulingDisabled worker 4h20m v1.23.12+7566c4d ropatil1611az-gs6zj-worker-westus-mk98k Ready worker 4h20m v1.24.6+5658434 ropatil1611az-gs6zj-worker-westus-ssrst Ready worker 4h20m v1.23.12+7566c4d
rohitpatil@ropatil-mac Downloads % oc get pvc,pod -n migration-upgrade-49678 -o wide NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE VOLUMEMODE persistentvolumeclaim/mypvc-49678 Bound pvc-cf8f4da1-5211-406e-986b-55c19439ee1a 2Gi RWO mysc-49678 123m BlockNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod/mydep-49678-6c64845f74-kvbp4 0/1 Terminating 0 123m 10.129.2.22 ropatil1611az-gs6zj-worker-westus-j4ngg <none> <none> pod/mydep-49678-6c64845f74-szxkd 1/1 Running 0 63m 10.131.0.25 ropatil1611az-gs6zj-worker-westus-mk98k <none> <none>
Actual results:
Upgrade of cluster is failing as 1 node reached to scheduling disabled state bec pod reached to Terminating state.
Expected results:
Upgrade of cluster should work fine.
Additional info:
1. This observation is basically on this template/Profile. 2. This issue is seen only on pod wrt Block level volume mode 3. Additional log analysis or node upgrade failure reason can be find here: https://issues.redhat.com/browse/OCPQE-12728
- links to
-
RHEA-2023:5006 rpm