Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-1321

ClusterVersion status should include version-Pod error details

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • ClusterVersion status should include version-Pod error details
    • BU Product Work
    • False
    • None
    • False
    • Not Selected
    • To Do
    • OCPSTRAT-1585 - Cluster-version operator version-pod failure accessability
    • OCPSTRAT-1585Cluster-version operator version-pod failure accessability
    • 0% To Do, 25% In Progress, 75% Done

      Epic Goal

      Currently the CVO launches a Job and waits for it to complete to get manifests for an incoming release payload. But the Job controller doesn't bubble up details about why the pod has trouble (e.g. Init:SignatureValidationFailed), so to get those details, we need direct access to the Pod. The Job controller doesn't seem like it's adding much value here, so the goal of this Epic is to drop it and create and monitor the Pod ourselves, so we can deliver better reporting of version-Pod state.

      Why is this important?

      When the version Pod fails to run, the cluster admin will likely need to take some action (clearing the update request, fixing a mirror registry, etc.). The more clearly we share the issues that the Pod is having with the cluster admin, the easier it will be for them to figure out their next steps.

      Scenarios

      oc adm upgrade and other ClusterVersion status UIs will be able to display Init:SignatureValidationFailed and other version-Pod failure modes directly. We don't expect to be able to give ClusterVersion consumers more detailed next-step advice, but hopefully the easier access to failure-mode context makes it easier for them to figure out next-steps on their own.

      Dependencies

      This change is purely and updates-team/OTA CVO pull request. No other dependencies.

      Contributing Teams

      • Development - OTA
      • Documentation - OTA
      • QE - OTA

      Acceptance Criteria

      Definition of done: failure modes like unretrievable image digests (e.g. quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000) or images with missing or unacceptable Sigstore signatures with OTA-1304's ClusterImagePolicy) have failure-mode details in ClusterVersion's RetrievePayload message, instead of the current Job was active longer than specified deadline.

      Drawbacks or Risk

      Limited audience, and failures like Init:SignatureValidationFailed are generic, while CVO version-Pod handling is pretty narrow. This may be redundant work if we end up getting nice generic init-Pod-issue handling like RFE-5627. But even if the work ends up being redundant, thinning the CVO stack by removing the Job controller is kind of nice.

      Done - Checklist

      The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.

      • CI Testing - Tests are merged and completing successfully
      • Documentation - Content development is complete.
      • QE - Test scenarios are written and executed successfully.
      • Technical Enablement - Slides are complete (if requested by PLM)
      • Other 

              trking W. Trevor King
              trking W. Trevor King
              Dinesh Kumar S Dinesh Kumar S
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: