XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • Future
    • None
    • Observability
    • None
    • Serviceability Epic
    • False
    • Hide

      None

      Show
      None
    • False
    • To Do
    • ACM-1169 - Deliver an ACM dashboard overview that conveys the overall status and health of all the ACM components
    • ACM-1169Deliver an ACM dashboard overview that conveys the overall status and health of all the ACM components

      Epic Goal

      • Build a set of tooling in the short term and alerting and dashboards etc in the longer term to help to improve the serviceability of ACM.  We also need to create tooling to improve the readability of the must-gather output. 

      Why is this important?

      • ACM has grown to contain a very large number of toolsets (operators) to make fleet management easier - but it is not easy to tell quickly if all parts of ACM are healthy. We have seen this in a number of customer sites. For example, let us assume a customer called us to fix some issues with GRC. And we fixed it to the satisfaction of the customer. We did not know though or had no easy means to check if there were other parts of ACM still broken. Imagine the annoyance of the customer if they find out later that observability was still broken - and the RH engineer just did not check it. This does give the impression that RHACM is unstable or hard to manage etc.
      • If a customer has several managed clusters with all addons enabled, it is very time-consuming and error-prone to have been able to scan the entire output. We usually look for the output related to the reported problem but may miss other things if they are broken - but not known (this will happen more and more because of the width of ACM). Again, this has happened.
      •  

      Scenarios (An SRE, ACM Hub admin, ACM Hub of Hubs admin, a service provider (ACM customer))

      1. As an SRE, we need to have a tool or script to check the ACM core functions' health.
        • For example, we imported a lot of clusters to the ACM Hub and now want to quickly verify all core agent & addon components are working (klusterlet/Obs/GRC/Search/CLC/Policy/etc).
        • If we have a tool to check these CR statuses, should be easy to know which components are not working.
      2. The customer or RH engineer requires to run a quick health check to see if RHACM hub is healthy.
        • This should comprehensively check all of RHACM Pillars, Operator Health, CR status, Metrics, Diskspace as well Node Health, API Server health etc.
        • If everything is not working and the fix can be easily determined, fix it if possible, and rerun the health check tool.
        • If this does not solve the problem, run the must-gather and send data back to the RH Engineering team.
        • As the tool matures, ideally the need to run must-gather should diminish.
      3. The RH SRE team is using ACM to manage a large set of clusters that runs customer workload. It is critical that they know they know if any part of ACM is failing - through alerts with backup SLOs and dashboards.
        • It is also important to identify if any of the services that ACM serves is failing their SLO - so that corrective actions can be taken.
      4. If the cluster admin can not access the managed cluster, we can only access it through the ACM Search service. In this case, we need more capabilities to be able to handle the corrective actions on the managed cluster.
        • OR, if possible, we use ACM to force workload off the managed cluster and quickly decommission it with the least impact to the applications.

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • We need more ACM components metrics (and Operator metrics) exposed and collected 
      • for the managed cluster metric acm_managed_cluster_info, we need more metadata from the label to the metrics. or we can provide a config for Observability to configure the external label as the metadata for all alerts on the ACM Hub.
      • Operator status collected as well (all critical data for diagnosis cannot be put into metrics)
      •  

      Dependencies (internal and external)

      1. Input/participation from all ACM squads to expose their Operator metrics and explain how to compute the operator health and SLO
      2. We could use search to get deeper insights about the working of ACM. However, how wise is it to use ACM (search being a part of ACM) to monitor ACM itself

      Previous Work (Optional):

      Open questions:

      1. Who will own this work in the future? SRE 

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              rhn-support-cstark Christian Stark
              crizzo71 Christine Rizzo
              Christine Rizzo Christine Rizzo
              Joydeep Banerjee Joydeep Banerjee
              Joy Jean Joy Jean
              Randy George Randy George
              Scott Berens Scott Berens
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: