Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-9353

Break glass mechanism to temporarily override an instance's configuration

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • None
    • instance break glass
    • False
    • None
    • False
    • No
    • To Do
    • 0% To Do, 0% In Progress, 100% Done
    • ---
    • ---
    • MK - Sprint 223, MK - Sprint 224, MK - Sprint 225, MK - Sprint 226

      We were discussing the potential for kafka user to cause their kafka instance to go out of memory by exhausting producerids.  This would lead to a OOM issue that would recur on restart.  It is likely SRE would have no time to intervene before the service OOM again.   It would be hard to recover an instance in this state with the tools we have today.

      There's effort going on to address the root cause of the issue (preventing excessive allocation of producerids), but this is likely to have a long lead time.

      In the meanwhile we need a mechanism to allow us to temporarily  override a kafka broker's configuration so that SRE are able (with Engineering's help) examine the broker, diagnose the issue and possibly make interventions.   To illustrate, a use-case with the producerid problem might be:

      1. Increase memory to allow the brokers to come up.
      2. Run tooling to confirm that excessive producerids is the root cause
      3. Temporarily lower transactional.id.expiration.ms to cause Kafka to flush out the accumulated producer ids from the system that cause the OOM
      4. Return the system to normal state.
      5. Work with customer to help them address the issues with the application that cause the producerid leak.

      The ability to override is likely to be valuable.

      • broker/zookeeper memory
      • environment variables
      • broker configuration

       

       

              medgar@redhat.com Michael Edgar
              keithbwall Keith Wall
              Kafka Fleet Services
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: