-
Epic
-
Resolution: Won't Do
-
Undefined
-
None
-
None
-
None
-
None
-
Simplify troubleshooting failing Argo CD instances
-
False
-
None
-
False
-
-
Epic Goal
- Assist/simplify the process of debugging/troubleshooting instances that users/admins have been alerted are failing by providing a centralized dashboard/page unique to OpenShift GitOps in the console that can provide consolidated information about which instances are healthy/failing and other related information
NOTE: This epic would build out the 2nd part of this proposal.
Why is this important?
- At present we have no infrastructure/processes in place to effectively debug why an instance is not in healthy state. We only have statuses of workloads tracked in the Argo CD CR status, which can be one of either 'unknown', 'failed', 'pending; or 'running'. If an instance is in 'failed' or 'pending' state with no additional information whatsoever. In such situations (especially production environments with multiple instances) this is not helpful. Typically troubleshooting such issues involves looking at events/logs etc. Users/admins will need to know how to go about debugging what the underlying cause is.
- Having a dedicated page/hub that contains information about the health status of all the instances in a color coded manner along with supplementary information like error logs and events to help accelerate the troubleshooting process would be a huge value add to admins that have to manage multiple instances/alert instance owners about issues
Scenarios
- I am a cluster admin who manages a cluster running OpenShift GitOps operator with 100s of Argo CD instances. I have
Acceptance Criteria (Mandatory)
- There is an established, central monitoring hub in the console for all Argo CD instances managed by the OpenShift GItOps operator that can be used to troubleshoot failing instances
Dependencies (internal and external)
Done Checklist
- Acceptance criteria are met
- Non-functional properties of the Feature have been validated (such as performance, resource, UX, security or privacy aspects)
- User Journey automation is delivered
- Support and SRE teams are provided with enough skills to support the feature in production environment
- depends on
-
GITOPS-2039 OpenShift GitOps is failing to report ArgoCD instance problems
- Closed
- is related to
-
GITOPS-1767 GitOps monitoring dashboards in admin console
- Closed