-
Epic
-
Resolution: Done
-
Major
-
1.3.0, 1.6.0
-
Instrument metrics for plugins that need monitoring
-
S
-
False
-
-
False
-
In Progress
-
RHIDP-3596 - Expose metrics for critical functionality
-
QE Needed, Docs Needed, TE Needed, Customer Facing, PX Needed
-
0% To Do, 0% In Progress, 100% Done
-
-
Feature
-
Done
-
-
EPIC Goal
What are we trying to solve here?
There isn't a good way to determine failures with integrating services. We should considering exposing metrics so customers set up their own monitoring and alerting
Background/Feature Origin
While scoping out auth provider scenarios, it became apparent that User/Group entity sync's between RHDH and IdPs could fail. We need to investigate other types of service integration failures that can cause information to be out of sync or become unavailable due to intermittent service outages
Why is this important?
This is important in cases where a user could be moved from a higher privileged group to a lower one. If there is a sync failure, the old permissions would be intact allowing unauthorized access. Without alerting, customers may not know there was a failure and it will not be immediately remediated
User Scenarios
- Instability in external systems. Alerting can prompt the RHDH admin to open a ticket to investigate failures/flakiness in the external system.
- Sync failures. Without monitoring, this will go undetected.
- If there is a complete outage in the external system then failure is obvious
- If the external system is out and the sync fails due to an expired token/APIKey, then it could fly under the radar
- Identification of product issues. Customers could see excessive calls to a service or API that degrades performance. This could potentially be a product design flaw for which they can open a ticket for.
Dependencies (internal and external)
Acceptance Criteria
Release Enablement/Demo - Provide necessary release enablement details
and documents
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub
Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Playwright: <link or reference to playwright>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>