-
Task
-
Resolution: Done
-
Undefined
-
1.4
-
5
-
False
-
-
False
-
RHIDP-5073 - OpenTelemetry Support
-
-
Removed Functionality
-
Done
-
-
RHDH supports the monitoring of prometheus metrics but the only metrics that are exposed today come from upstream (catalog and scaffolder). We need to think about areas of failure that could impact the availability and integrity of RHDH and its integrating services.
Goals (aka. expected user outcomes)
The observable functionality that the user now has as a result of receiving
this feature. Include the anticipated primary user type/persona and which
existing features, if any, will be expanded.
- The RHDH platform engineer can forward prometheus metrics to their monitoring stack in order to configure alerting for critical areas of concern with the openTelemetry client
- Expose metrics that customers can take action on e.g.
-
- Number of IdP sync failures
- Number of auth failures
- Any system errors that the admin can either recover from (i.e. pod restart, catalog clean up) or follow up with by monitoring external service outages (e.g. github, Azure, Quay, etc)
Requirements:
- Instrumentation upstream has deprecated prom-client in favour of opentelemetry. We need to determine what that means for migration and support. For example, catalog metrics is considered experimental. We can't declare support for it if it has that designation.
- Review the metrics that are exposed upstream to see if they are enough to cover our customers' situations. Scaffolder and Catalog metrics seem to be based on the deprecated prom-client. We need to determine if a replacement exists.
- Instrument key plugins that will require metrics. Integrations with external service providers are a primary candidate
- Build up scenarios for Docs and QE testing
1.
|
[DOC] SME Review | Closed | Jessica He | ||
2.
|
[DOC] QE Review | Closed | Zbynek Drapela | ||
3.
|
[DOC] Peer Review | Closed | Fabrice Flore-Thébault |