• Instrument metrics for plugins that need monitoring
    • S
    • False
    • Hide

      None

      Show
      None
    • False
    • In Progress
    • RHIDP-3596 - Expose metrics for critical functionality
    • QE Needed, Docs Needed, TE Needed, Customer Facing, PX Needed
    • 0% To Do, 0% In Progress, 100% Done
    • Hide
      = OpenTelemetry metrics support added to the Keycloak backend plugin

      With this update, the Keycloak backend plugin supports OpenTelemetry metrics, which monitors fetch operations and diagnoses potential issues.

      The available counters include the following:

      * `backend_keycloak_fetch_task_failure_count_total`: Counts fetch task failures where no data was returned due to an error.​

      * `backend_keycloak_fetch_data_batch_failure_count_total`: Counts partial data batch failures. Even if some batches fail, the plugin continues fetching others.​

      These counters include the `taskInstanceId` label, which uniquely identifies each scheduled fetch task, and allows you to trace failures back to individual task executions.

      Example configuration:

      ```text
      backend_keycloak_fetch_data_batch_failure_count_total{taskInstanceId="df040f82-2e80-44bd-83b0-06a984ca05ba"} 1
      ```

      You can export metrics using any OpenTelemetry-compatible backend, such as **Prometheus**.
      Show
      = OpenTelemetry metrics support added to the Keycloak backend plugin With this update, the Keycloak backend plugin supports OpenTelemetry metrics, which monitors fetch operations and diagnoses potential issues. The available counters include the following: * `backend_keycloak_fetch_task_failure_count_total`: Counts fetch task failures where no data was returned due to an error.​ * `backend_keycloak_fetch_data_batch_failure_count_total`: Counts partial data batch failures. Even if some batches fail, the plugin continues fetching others.​ These counters include the `taskInstanceId` label, which uniquely identifies each scheduled fetch task, and allows you to trace failures back to individual task executions. Example configuration: ```text backend_keycloak_fetch_data_batch_failure_count_total{taskInstanceId="df040f82-2e80-44bd-83b0-06a984ca05ba"} 1 ``` You can export metrics using any OpenTelemetry-compatible backend, such as **Prometheus**.
    • Feature
    • Done

      EPIC Goal

      What are we trying to solve here?

      There isn't a good way to determine failures with integrating services.  We should considering exposing metrics so customers set up their own monitoring and alerting

      Background/Feature Origin

      While scoping out auth provider scenarios, it became apparent that User/Group entity sync's between RHDH and IdPs could fail. We need to investigate other types of service integration failures that can cause information to be out of sync or become unavailable due to intermittent service outages

      Why is this important?

      This is important in cases where a user could be moved from a higher privileged group to a lower one.  If there is a sync failure, the old permissions would be intact allowing unauthorized access.  Without alerting, customers may not know there was a failure and it will not be immediately remediated

      User Scenarios

      • Instability in external systems.  Alerting can prompt the RHDH admin to open a ticket to investigate failures/flakiness in the external system.
      • Sync failures.  Without monitoring, this will go undetected. 
        • If there is a complete outage in the external system then failure is obvious
        • If the external system is out and the sync fails due to an expired token/APIKey, then it could fly under the radar
      • Identification of product issues.  Customers could see excessive calls to a service or API that degrades performance.  This could potentially be a product design flaw for which they can open a ticket for. 

      Dependencies (internal and external)

      Acceptance Criteria

      Release Enablement/Demo - Provide necessary release enablement details
      and documents

      DEV - Upstream code and tests merged: <link to meaningful PR or GitHub
      Issue>

      DEV - Upstream documentation merged: <link to meaningful PR or GitHub
      Issue>

      DEV - Downstream build attached to advisory: <link to errata>

      QE - Test plans in Playwright: <link or reference to playwright>

      QE - Automated tests merged: <link or reference to automated tests>

      DOC - Downstream documentation merged: <link to meaningful PR>

              oandriie Aleksander Andriienko
              ktsao@redhat.com Kim Tsao
              RHIDP - Plugins
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: