Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-8860

Update SOPs so that SREs know how to work around MGDSTRM-8857

XMLWordPrintable

    • MK - Sprint 221

      WHAT

      MGDSTRM-8857 is known to be one of causes for UnderReplicatedPartitions or UnderMinIsrPartitionCount conditions.  ** We need to document the recovery procedure.

      Update SOPs so that SREs know to try:

      1. using kafka-reassign-partitions.sh to move  the leadership (in the case where two replicas are in-sync)  
      2. Restarting the broker that it leader for the partition concerned.
        1. First using the Strimzi annotation
        2. Fallback using a {{oc delete pod }}(only if Strimzi won't action the roll).

      We need to phrase the such a way the SREs only take this path when this defect is suspected.  We don't want the SREs to be conditioned into 'turning it off/on' as a cure all.

      WHY

      Allow SREs to workaround MGDSTRM-8857  and return managed kafka to normal service.

       

      HOW

       

      https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/blob/ecb6151b177b4019deec3072fd9fe7250c663871/sops/alerts/partition_under_replicated.asciidoc

      https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/blob/935470cd541d0a885cb1b79f274223c6d8bb5d80/sops/alerts/under_min_isr_partition_count.asciidoc

      DONE

      Include the following where applicable:

      • SOPs updated, reviewed and accepted by RTS and SREs.

       

            lukchen@redhat.com Luke Chen
            keithbwall Keith Wall
            Kafka Integrations
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: