Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-42932

4.15-4.18 upgrade stuck on authentication operator during stage of 4.17-4.18 update

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • None
    • 4.18
    • apiserver-auth
    • Critical
    • Yes
    • Approved
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

          Failed ci jobs:
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-multi-nightly-4.18-cpou-upgrade-from-4.15-aws-ipi-mini-perm-arm-f14/1842004955238502400
      
      https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-arm64-nightly-4.18-cpou-upgrade-from-4.15-azure-ipi-fullyprivate-proxy-f14/1841942041722884096
      
      The 4.15-4.18 upgrade failed at stage of 4.17 to 4.18 update while authentication operator degraded and unavailable due to APIServerDeployment_PreconditionNotFulfilled
      
      $ omc get clusterversion
      NAME      VERSION                                    AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.17.0-0.nightly-arm64-2024-10-03-172957   True        True          1h44m   Unable to apply 4.18.0-0.nightly-arm64-2024-10-03-125849: the cluster operator authentication is not available
      
      $ omc get co authentication
      NAME             VERSION                                    AVAILABLE   PROGRESSING   DEGRADED   SINCE
      authentication   4.18.0-0.nightly-arm64-2024-10-03-125849   False       False         True       8h
      
      $ omc get co authentication -ojson|jq .status.conditions[]
      {
        "lastTransitionTime": "2024-10-04T04:22:39Z",
        "message": "APIServerDeploymentDegraded: waiting for .status.latestAvailableRevision to be available\nAPIServerDeploymentDegraded: ",
        "reason": "APIServerDeployment_PreconditionNotFulfilled",
        "status": "True",
        "type": "Degraded"
      }
      {
        "lastTransitionTime": "2024-10-04T03:54:13Z",
        "message": "AuthenticatorCertKeyProgressing: All is well",
        "reason": "AsExpected",
        "status": "False",
        "type": "Progressing"
      }
      {
        "lastTransitionTime": "2024-10-04T03:52:34Z",
        "reason": "APIServerDeployment_PreconditionNotFulfilled",
        "status": "False",
        "type": "Available"
      }
      {
        "lastTransitionTime": "2024-10-03T21:32:31Z",
        "message": "All is well",
        "reason": "AsExpected",
        "status": "True",
        "type": "Upgradeable"
      }
      {
        "lastTransitionTime": "2024-10-04T00:04:57Z",
        "reason": "NoData",
        "status": "Unknown",
        "type": "EvaluationConditionsDetected"
      }

      Version-Release number of selected component (if applicable):

       4.18.0-0.nightly-arm64-2024-10-03-125849
       4.18.0-0.nightly-multi-2024-10-03-193054
      
      

      How reproducible:

          always

      Steps to Reproduce:

          1. upgrade from 4.15 to 4.16, and then to 4.17, and then to 4.18
          2.
          3.
          

      Actual results:

          upgrade stuck on authentication operator

      Expected results:

          upgrade succeed

      Additional info:

          The issue is found in a control plane only update jobs(with paused worker pool), but it's not cpou specified because it can be reproduced in a normal chain upgrade from 4.15 to 4.18 upgrade. 

            [OCPBUGS-42932] 4.15-4.18 upgrade stuck on authentication operator during stage of 4.17-4.18 update

            Errata Tool added a comment -

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.18.1 bug fix and security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2024:6122

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.18.1 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:6122

            As for the .readyReplicas is 0, filed separate bug OCPBUGS-43909 .

            Xingxing Xia added a comment - As for the .readyReplicas is 0, filed separate bug OCPBUGS-43909 .

            Given the previous failed CI jobs passed without the failures, moving it to Verified. Corresponding notes:

            In 4.18, it correspondingly shows correct .status.latestAvailableRevision now. In 4.18.0-0.nightly-2024-10-28-141654:
            $ oc get authentication.operator cluster -o yaml | grep -A 1000 "^status:"
            ...
              generations:
              - group: apps
                lastGeneration: 5
                name: apiserver
                namespace: openshift-oauth-apiserver
                resource: deployments
              - group: apps
                lastGeneration: 4
                name: oauth-openshift
                namespace: openshift-authentication
                resource: deployments
              latestAvailableRevision: 1
              readyReplicas: 0
            
            Comparing with 4.17 (4.17.0-0.nightly-2024-10-28-130315):
            $ oc get authentication.operator cluster -o yaml | grep -A 1000 "^status:"
            ...
              generations:
              - group: apps
                hash: ""
                lastGeneration: 3
                name: apiserver
                namespace: openshift-oauth-apiserver
                resource: deployments
              - group: apps
                hash: ""
                lastGeneration: 3
                name: oauth-openshift
                namespace: openshift-authentication
                resource: deployments
              oauthAPIServer:
                latestAvailableRevision: 1
              readyReplicas: 0
            

            Xingxing Xia added a comment - Given the previous failed CI jobs passed without the failures, moving it to Verified. Corresponding notes: In 4.18, it correspondingly shows correct .status.latestAvailableRevision now. In 4.18.0-0.nightly-2024-10-28-141654: $ oc get authentication. operator cluster -o yaml | grep -A 1000 "^status:" ... generations: - group: apps lastGeneration: 5 name: apiserver namespace: openshift-oauth-apiserver resource: deployments - group: apps lastGeneration: 4 name: oauth-openshift namespace: openshift-authentication resource: deployments latestAvailableRevision: 1 readyReplicas: 0 Comparing with 4.17 (4.17.0-0.nightly-2024-10-28-130315): $ oc get authentication. operator cluster -o yaml | grep -A 1000 "^status:" ... generations: - group: apps hash: "" lastGeneration: 3 name: apiserver namespace: openshift-oauth-apiserver resource: deployments - group: apps hash: "" lastGeneration: 3 name: oauth-openshift namespace: openshift-authentication resource: deployments oauthAPIServer: latestAvailableRevision: 1 readyReplicas: 0

            Jia Liu added a comment - Reran both of the 4.15-4.18 ci jobs, upgrade pass. https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-multi-nightly-4.18-cpou-upgrade-from-4.15-aws-ipi-mini-perm-arm-f28/1850724029330100224 https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-arm64-nightly-4.18-cpou-upgrade-from-4.15-azure-ipi-fullyprivate-proxy-f28/1850724698871042048

            Hi deads@redhat.com,

            Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            OpenShift Jira Bot added a comment - Hi deads@redhat.com , Bugs should not be moved to Verified without first providing a Release Note Type("Bug Fix" or "No Doc Update") and for type "Bug Fix" the Release Note Text must also be provided. Please populate the necessary fields before moving the Bug to Verified.

            > Downloaded the must-gather and analyzed ...

            deads@redhat.com , yes, the must-gather (download links: must-gather.tar 1 , must-gather.tar 2 ) can be found from the Prow CI jobs URLs pasted under the "Failed ci jobs" in above "Description" section.

            (General way to get the download: click the jobs URLs (it will prompt RH internal SSO login) -> click upper right "Artifacts" (RH internal login) -> click "artifacts/" -> click "<test_job_name>/" -> click "gather-must-gather/" -> click "artifacts/" -> click "must-gather.tar" to download)

            Xingxing Xia added a comment - > Downloaded the must-gather and analyzed ... deads@redhat.com , yes, the must-gather (download links: must-gather.tar 1 , must-gather.tar 2 ) can be found from the Prow CI jobs URLs pasted under the "Failed ci jobs" in above "Description" section. (General way to get the download: click the jobs URLs (it will prompt RH internal SSO login) -> click upper right "Artifacts" (RH internal login) -> click "artifacts/" -> click "<test_job_name>/" -> click "gather-must-gather/" -> click "artifacts/" -> click "must-gather.tar" to download)

            Jia Liu added a comment - Must-gather logs could be found in above failed job link, for example https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.18-multi-nightly-4.18-cpou-upgrade-from-4.15-aws-ipi-mini-perm-arm-f14/1842004955238502400/artifacts/aws-ipi-mini-perm-arm-f14/gather-must-gather/artifacts/  

            David Eads added a comment -

            Is there a must-gather collected for this condition?  We'll need that to debug.

            David Eads added a comment - Is there a must-gather collected for this condition?  We'll need that to debug.

            Compare other CO that has only one single operand, e.g. CO openshift-apiserver, others have direct .latestAvailableRevision under .status:

            $ omc get openshiftapiserver.operator cluster -o yaml
            ...
            status:
              ...
              generations:
              - group: apps
                hash: ""
                lastGeneration: 13
                name: apiserver
                namespace: openshift-apiserver
                resource: deployments
              latestAvailableRevision: 2
              observedGeneration: 9
              readyReplicas: 0
            

             
            I'm marking this bug release Blocker.

            BTW deads@redhat.com , the .readyReplicas is 0, this is another bug, right?

            Xingxing Xia added a comment - Compare other CO that has only one single operand, e.g. CO openshift-apiserver, others have direct .latestAvailableRevision under .status: $ omc get openshiftapiserver. operator cluster -o yaml ... status: ... generations: - group: apps hash: "" lastGeneration: 13 name: apiserver namespace: openshift-apiserver resource: deployments latestAvailableRevision: 2 observedGeneration: 9 readyReplicas: 0   I'm marking this bug release Blocker. BTW deads@redhat.com , the .readyReplicas is 0, this is another bug, right?

            Downloaded the must-gather and analyzed. CAO differs than other CO in that CAO has two operands (oauthAPIServer and oauth-openshift), so its "status:" includes .oauthAPIServer.latestAvailableRevision instead of directly .latestAvailableRevision, that is why the upgrade failed stuck at "waiting for .status.latestAvailableRevision to be available". This is caused by 4.18 MOM epic API-1835, I commented now in one of its many PRs https://github.com/openshift/cluster-authentication-operator/pull/704/files#r1797602127 .

            $ omc get authentication.operator cluster -o yaml
            ...
            status:
              conditions:
              ...
              - lastTransitionTime: "2024-10-08T17:58:27Z"
                reason: PreconditionNotFulfilled
                status: "False"
                type: APIServerDeploymentAvailable
              - lastTransitionTime: "2024-10-08T17:58:27Z"
                message: |
                  waiting for .status.latestAvailableRevision to be available
                reason: PreconditionNotFulfilled
                status: "True"
                type: APIServerDeploymentDegraded
            ...
              generations:
              - group: apps
                hash: ""
                lastGeneration: 8
                name: apiserver
                namespace: openshift-oauth-apiserver
                resource: deployments
              - group: apps
                hash: ""
                lastGeneration: 7
                name: oauth-openshift
                namespace: openshift-authentication
                resource: deployments
              - group: apps
                hash: ""
                lastGeneration: 0
                name: ""
                namespace: ""
                resource: deployments
              oauthAPIServer:
                latestAvailableRevision: 1
              readyReplicas: 0
            

            Xingxing Xia added a comment - Downloaded the must-gather and analyzed. CAO differs than other CO in that CAO has two operands (oauthAPIServer and oauth-openshift), so its "status:" includes .oauthAPIServer.latestAvailableRevision instead of directly .latestAvailableRevision, that is why the upgrade failed stuck at "waiting for .status.latestAvailableRevision to be available". This is caused by 4.18 MOM epic API-1835, I commented now in one of its many PRs https://github.com/openshift/cluster-authentication-operator/pull/704/files#r1797602127 . $ omc get authentication. operator cluster -o yaml ... status: conditions: ... - lastTransitionTime: "2024-10-08T17:58:27Z" reason: PreconditionNotFulfilled status: "False" type: APIServerDeploymentAvailable - lastTransitionTime: "2024-10-08T17:58:27Z" message: | waiting for .status.latestAvailableRevision to be available reason: PreconditionNotFulfilled status: "True" type: APIServerDeploymentDegraded ... generations: - group: apps hash: "" lastGeneration: 8 name: apiserver namespace: openshift-oauth-apiserver resource: deployments - group: apps hash: "" lastGeneration: 7 name: oauth-openshift namespace: openshift-authentication resource: deployments - group: apps hash: "" lastGeneration: 0 name: "" namespace: "" resource: deployments oauthAPIServer: latestAvailableRevision: 1 readyReplicas: 0

              deads@redhat.com David Eads
              rhn-support-jiajliu Jia Liu
              Xingxing Xia Xingxing Xia
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: