Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-7053

High CPU loop in Strimzi OAuth Client for unassigned Quarkus Native Kafka consumers on OpenShift after Keycloak outages

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 3.1.0.GA
    • kafka, kafka-clients
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Moderate

      There is already an issue open here: https://github.com/strimzi/strimzi-kafka-oauth/issues/291

      Environment:

      • Platform: OpenShift Container Platform
      • Runtime: Quarkus Native (built as native executable, running in containers)
      • Integration: Strimzi Kafka Cluster with Red Hat Build of Keycloak
      • Authentication: OAuth (via strimzi-kafka-oauth)

      Description: We identified an issue in the Strimzi OAuth Client Library when running within a Quarkus Native application on OpenShift.

      The workload involves a Kafka consumer deployment scaled to a high number of replicas (20–30 pods). However, the target Kafka topic is configured with a low number of partitions (e.g., 3 partitions). Even this doesn’t follow best practice: As a result, the majority of the pods are not actively assigned to a partition and remain in a "waiting mode" or backlog state.

      When the authentication provider (in our case Keycloak) experiences an outage, is updating, or returns server errors (simulated via 500 errors at the token endpoint), these unassigned containers enter a dead loop characterized by excessive CPU consumption after the Keycloak service was up again.

      Symptoms:

      • CPU Spike: After a Keycloak outage, CPU usage on the affected pods spikes from a baseline of ~0.002 Core to approximately 400x that amount (roughly 0.8 Core) and does not recover automatically.
      • Scope: This behavior is restricted to pods that are not assigned to a Kafka partition. Active members of the Consumer Group function normally; only the unassigned, waiting containers are trapped in the loop.
      • Persistence: The unassigned containers are unable to clear the corrupt authentication loop even after connectivity is restored.

      Root Cause Analysis: We assume that this is a bug in the Strimzi OAuth Client Library. The library fails to properly clean up a corrupt authentication state following a runtime failure during token validation. This specifically affects consumers that are waiting for a partition assignment; they enter a tight retry loop that consumes significant CPU resources.

      Steps to Reproduce: Reproduction repository provided: _https://github.com/marcoklaassen/keycloak-kafka-quarkus_

      1. Environment Setup: Deploy a Strimzi Kafka Cluster (CR) and Keycloak on OpenShift.
      2. Application Deployment: Deploy a Quarkus native application configured to consume messages using the Strimzi OAuth client.
      3. Scale Mismatch: Scale the deployment to significantly exceed the number of topic partitions (e.g., 10 pods for a 3-partition topic) so that multiple pods are idle/unassigned.
      4. Simulate Outage: Use a proxy (e.g., Python script) to inject 500 Server Errors at the Keycloak token endpoint to simulate an outage for a few minutes so all pods realized the outage.
      5. Fix Outage: Let the keycloak work as expected again.
      6. Observation: Monitor the CPU usage of the unassigned pods.
      7. Actual Result: Unassigned pods spike to ~400x normal CPU load and fail to recover.

      Workaround: Match the replica count of the pods to the number of Kafka partitions to prevent containers from entering the unassigned "waiting" state.

      Strimzi Version 

      Streams for Apache Kafka (3.1.0-6 provided by Red Hat)

      kafkaVersion: 4.0.0

       

      Strimzi Kafka OAuth Version

      0.15.0 (0.17.x is not working with Quarkus Native Build out of the box 🚨🔥)

       

      K8s Version

      OpenShift 4.19.17  (k8s 1.32.6)

       

      Installation Method

      Strimzi Kafka OAuth Version: Maven Dependency, Quarkus, Native Build

       

      Infrastructure

      SNO OpenShift

              Unassigned Unassigned
              mklaasse@redhat.com Marco Klaassen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: