XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: 1.6.0
Affects Version/s: 1.3.0, 1.6.0
Component/s: Core platform, Security
Labels:
- rhdh-1.6-candidate

Epic Name:
Instrument metrics for plugins that need monitoring
Size:
S
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Status:
In Progress
Feature Link:
RHIDP-3596 - Expose metrics for critical functionality
Planning:

QE Needed, Docs Needed, TE Needed, Customer Facing, PX Needed
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:

Hide
= OpenTelemetry metrics support added to the Keycloak backend plugin

With this update, the Keycloak backend plugin supports OpenTelemetry metrics, which monitors fetch operations and diagnoses potential issues.

The available counters include the following:

* `backend_keycloak_fetch_task_failure_count_total`: Counts fetch task failures where no data was returned due to an error.

* `backend_keycloak_fetch_data_batch_failure_count_total`: Counts partial data batch failures. Even if some batches fail, the plugin continues fetching others.

These counters include the `taskInstanceId` label, which uniquely identifies each scheduled fetch task, and allows you to trace failures back to individual task executions.

Example configuration:

```text
backend_keycloak_fetch_data_batch_failure_count_total{taskInstanceId="df040f82-2e80-44bd-83b0-06a984ca05ba"} 1
```

You can export metrics using any OpenTelemetry-compatible backend, such as **Prometheus**.

Show
= OpenTelemetry metrics support added to the Keycloak backend plugin With this update, the Keycloak backend plugin supports OpenTelemetry metrics, which monitors fetch operations and diagnoses potential issues. The available counters include the following: * `backend_keycloak_fetch_task_failure_count_total`: Counts fetch task failures where no data was returned due to an error. * `backend_keycloak_fetch_data_batch_failure_count_total`: Counts partial data batch failures. Even if some batches fail, the plugin continues fetching others. These counters include the `taskInstanceId` label, which uniquely identifies each scheduled fetch task, and allows you to trace failures back to individual task executions. Example configuration: ```text backend_keycloak_fetch_data_batch_failure_count_total{taskInstanceId="df040f82-2e80-44bd-83b0-06a984ca05ba"} 1 ``` You can export metrics using any OpenTelemetry-compatible backend, such as **Prometheus**.
Release Note Type:
Feature
Release Note Status:
Done
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

EPIC Goal

What are we trying to solve here?

There isn't a good way to determine failures with integrating services. We should considering exposing metrics so customers set up their own monitoring and alerting

Background/Feature Origin

While scoping out auth provider scenarios, it became apparent that User/Group entity sync's between RHDH and IdPs could fail. We need to investigate other types of service integration failures that can cause information to be out of sync or become unavailable due to intermittent service outages

Why is this important?

This is important in cases where a user could be moved from a higher privileged group to a lower one. If there is a sync failure, the old permissions would be intact allowing unauthorized access. Without alerting, customers may not know there was a failure and it will not be immediately remediated

User Scenarios

Instability in external systems. Alerting can prompt the RHDH admin to open a ticket to investigate failures/flakiness in the external system.
Sync failures. Without monitoring, this will go undetected.
- If there is a complete outage in the external system then failure is obvious
- If the external system is out and the sync fails due to an expired token/APIKey, then it could fly under the radar
Identification of product issues. Customers could see excessive calls to a service or API that degrades performance. This could potentially be a product design flaw for which they can open a ticket for.

Dependencies (internal and external)

Acceptance Criteria

Release Enablement/Demo - Provide necessary release enablement details
and documents

DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
Issue>

DEV - Upstream documentation merged: <link to meaningful PR or GitHub
Issue>

DEV - Downstream build attached to advisory: <link to errata>

QE - Test plans in Playwright: <link or reference to playwright>

QE - Automated tests merged: <link or reference to automated tests>

DOC - Downstream documentation merged: <link to meaningful PR>

Assignee:: Aleksander Andriienko

Reporter:: Kim Tsao

Team:: RHIDP - Plugins

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/08/09 8:56 PM

Updated:: 2025/05/07 6:57 PM

Resolved:: 2025/04/21 2:37 PM

Details

Description

EPIC Goal

Background/Feature Origin

Why is this important?

User Scenarios

Dependencies (internal and external)

Acceptance Criteria

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide