XML

Word

Printable

Type: Epic
Resolution: Won't Do
Priority: Major
Fix Version/s: Future
Affects Version/s: None
Component/s: Observability
Labels:
None

Epic Name:
Serviceability Epic
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Status:
To Do
Feature Link:
ACM-1169 - Deliver an ACM dashboard overview that conveys the overall status and health of all the ACM components
Parent Link:
ACM-1169Deliver an ACM dashboard overview that conveys the overall status and health of all the ACM components

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Epic Goal

Build a set of tooling in the short term and alerting and dashboards etc in the longer term to help to improve the serviceability of ACM. We also need to create tooling to improve the readability of the must-gather output.

Why is this important?

ACM has grown to contain a very large number of toolsets (operators) to make fleet management easier - but it is not easy to tell quickly if all parts of ACM are healthy. We have seen this in a number of customer sites. For example, let us assume a customer called us to fix some issues with GRC. And we fixed it to the satisfaction of the customer. We did not know though or had no easy means to check if there were other parts of ACM still broken. Imagine the annoyance of the customer if they find out later that observability was still broken - and the RH engineer just did not check it. This does give the impression that RHACM is unstable or hard to manage etc.
If a customer has several managed clusters with all addons enabled, it is very time-consuming and error-prone to have been able to scan the entire output. We usually look for the output related to the reported problem but may miss other things if they are broken - but not known (this will happen more and more because of the width of ACM). Again, this has happened.

Scenarios (An SRE, ACM Hub admin, ACM Hub of Hubs admin, a service provider (ACM customer))

As an SRE, we need to have a tool or script to check the ACM core functions' health.
- For example, we imported a lot of clusters to the ACM Hub and now want to quickly verify all core agent & addon components are working (klusterlet/Obs/GRC/Search/CLC/Policy/etc).
- If we have a tool to check these CR statuses, should be easy to know which components are not working.
The customer or RH engineer requires to run a quick health check to see if RHACM hub is healthy.
- This should comprehensively check all of RHACM Pillars, Operator Health, CR status, Metrics, Diskspace as well Node Health, API Server health etc.
- If everything is not working and the fix can be easily determined, fix it if possible, and rerun the health check tool.
- If this does not solve the problem, run the must-gather and send data back to the RH Engineering team.
- As the tool matures, ideally the need to run must-gather should diminish.
The RH SRE team is using ACM to manage a large set of clusters that runs customer workload. It is critical that they know they know if any part of ACM is failing - through alerts with backup SLOs and dashboards.
- It is also important to identify if any of the services that ACM serves is failing their SLO - so that corrective actions can be taken.
If the cluster admin can not access the managed cluster, we can only access it through the ACM Search service. In this case, we need more capabilities to be able to handle the corrective actions on the managed cluster.
- OR, if possible, we use ACM to force workload off the managed cluster and quickly decommission it with the least impact to the applications.

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
We need more ACM components metrics (and Operator metrics) exposed and collected
for the managed cluster metric acm_managed_cluster_info, we need more metadata from the label to the metrics. or we can provide a config for Observability to configure the external label as the metadata for all alerts on the ACM Hub.
Operator status collected as well (all critical data for diagnosis cannot be put into metrics)

Dependencies (internal and external)

Input/participation from all ACM squads to expose their Operator metrics and explain how to compute the operator health and SLO
We could use search to get deeper insights about the working of ACM. However, how wise is it to use ACM (search being a part of ACM) to monitor ACM itself

Previous Work (Optional):

Open questions:

Who will own this work in the future? SRE

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

is related to

ACM-1246 Display the health of the ACM hub

Closed

Assignee:: Christian Stark

Reporter:: Christine Rizzo

Manager:: Christine Rizzo

Technical Lead:: Joydeep Banerjee

Designer:: Joy Jean

Architect:: Randy George

Product Manager:: Scott Berens

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/06/02 5:28 PM

Updated:: 2025/12/15 3:37 PM

Resolved:: 2025/12/15 3:37 PM

Details

Description

Epic Goal

Why is this important?

Scenarios (An SRE, ACM Hub admin, ACM Hub of Hubs admin, a service provider (ACM customer))

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates