Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-1246

Display the health of the ACM hub

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Obsolete
    • Icon: Critical Critical
    • Future
    • None
    • Search
    • Display the health of an ACM hub
    • False
    • Hide

      None

      Show
      None
    • False
    • To Do
    • ACM-1169 - Deliver an ACM dashboard overview that conveys the overall status and health of all the ACM components
    • ACM-1169Deliver an ACM dashboard overview that conveys the overall status and health of all the ACM components

      Epic Goal

      • We need to prioritize views into the health of ACM hub
      • Future state we can use the health of Hub to determine when to deploy a new Leaf hub for HoH, or, shift workloads to Leaf Hub with more capacity

      Why is this important?

      • As we use ACM / MCE in support of SD, AOC, HoH, and large scale deployments with customers, we must know the health of the hub
      • SRE needs these today as the use ACM / MCE for Ansible on Cloud

      Scenarios

      1. As an ACM SRE I want to be alerted if any ACM components are NOT healthy.
      2. As an ACM SRE I want visibility of the component health. We need instrumentation into our components to be able to surface to the user if a feature is working properly or not: 
        1. I want to see that Search V2 service and the backing datastore postgreSQL is performing well and within expected service level
        2. I want to see that GRC service is performing well and within expected service level
        3. I want to see that AppSub / OpenShift GitOps is performing well and within expected service level
        4. I want to see that Thanos monitoring/alerting service is performing well. This includes the observatorium api gateway, the Thanos components, grafana, alertmanager, etc. This does NOT include the object store which is customer provided. However, issues writing to the object store should be surfaced. 
        5. I want to see that Hive service is performing well and within expected service level
        6. I want to see that Infrastructure Operator service (CIM/AI)  is performing well and within expected service level
        7. I want to see that ACM Console is performing well and within expected service level
      3. As an ACM User I want some better feedback about the operation I just did and whether it progressed/completed, and how quickly it was done

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • An SRE can observe the health of ACM components.
      • An SRE can be notified about critical alerts when the health of ACM components is sub-optimal

      Dependencies (internal and external)

      1. Code instrumentation across all ACM pillars
      2.  

      Previous Work (Optional):

      Open questions:

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              zlayne@redhat.com Zackery Layne
              sberens@redhat.com Scott Berens
              Joydeep Banerjee Joydeep Banerjee
              Xiang Yin Xiang Yin
              Joy Jean Joy Jean
              Scott Berens Scott Berens
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: