-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
ClusterVersion status should include version-Pod error details
-
BU Product Work
-
False
-
None
-
False
-
Not Selected
-
To Do
-
OCPSTRAT-1585 - Cluster-version operator version-pod failure accessability
-
OCPSTRAT-1585Cluster-version operator version-pod failure accessability
-
0% To Do, 25% In Progress, 75% Done
Epic Goal
Currently the CVO launches a Job and waits for it to complete to get manifests for an incoming release payload. But the Job controller doesn't bubble up details about why the pod has trouble (e.g. Init:SignatureValidationFailed), so to get those details, we need direct access to the Pod. The Job controller doesn't seem like it's adding much value here, so the goal of this Epic is to drop it and create and monitor the Pod ourselves, so we can deliver better reporting of version-Pod state.
Why is this important?
When the version Pod fails to run, the cluster admin will likely need to take some action (clearing the update request, fixing a mirror registry, etc.). The more clearly we share the issues that the Pod is having with the cluster admin, the easier it will be for them to figure out their next steps.
Scenarios
oc adm upgrade and other ClusterVersion status UIs will be able to display Init:SignatureValidationFailed and other version-Pod failure modes directly. We don't expect to be able to give ClusterVersion consumers more detailed next-step advice, but hopefully the easier access to failure-mode context makes it easier for them to figure out next-steps on their own.
Dependencies
This change is purely and updates-team/OTA CVO pull request. No other dependencies.
Contributing Teams
- Development - OTA
- Documentation - OTA
- QE - OTA
Acceptance Criteria
Definition of done: failure modes like unretrievable image digests (e.g. quay.io/openshift-release-dev/ocp-release@sha256:0000000000000000000000000000000000000000000000000000000000000000) or images with missing or unacceptable Sigstore signatures with OTA-1304's ClusterImagePolicy) have failure-mode details in ClusterVersion's RetrievePayload message, instead of the current Job was active longer than specified deadline.
Drawbacks or Risk
Limited audience, and failures like Init:SignatureValidationFailed are generic, while CVO version-Pod handling is pretty narrow. This may be redundant work if we end up getting nice generic init-Pod-issue handling like RFE-5627. But even if the work ends up being redundant, thinning the CVO stack by removing the Job controller is kind of nice.
Done - Checklist
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
- CI Testing - Tests are merged and completing successfully
- Documentation - Content development is complete.
- QE - Test scenarios are written and executed successfully.
- Technical Enablement - Slides are complete (if requested by PLM)
- Other
- relates to
-
OTA-1170 [TechPreview] Support verifying release images with Sigstore signatures
- Closed
- links to