Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-3676

Ever growing __consumer_offsets partitions

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Critical Critical
    • 1.6.4.GA, 1.7.0.GA
    • 1.0.0.OCP.GA, 1.0.0.RHEL.GA, 1.0.1.GA, 1.1.0.GA, 1.1.1.GA, 1.2.0.GA, 1.3.0.GA, 1.4.0.GA, 1.4.1.GA, 1.5.0.GA, 1.6.0.GA, 1.6.2.GA
    • kafka-broker
    • False
    • False

      We are investigating an issue on AMQ Streams 1.7 (Kafka 2.7) on RHEL, with some __consumer_offsets partitions that are never cleaned, causing the disk to fills up and long CGs recovering times when a broker is restarted.

      Focusing on one of these partitions (__consumer_offsets-15), we identified the root cause as a stuck last stable offset (LSO), due to a hanging transaction. Note that the issue on this specific partitions dates back to 2020-11-26T21:04:33.837Z, when the customer was running Kafka 2.2. The other affected partitions are probably around that time (to be confirmed).

      The following report is derived from LogCleanerManager DEBUG logs. You can see that the range between FUO and ASBO is growing proportionally to the partition size, while it is zero in all non-affected partitions.

      PARTITION              SIZE  FDO         FUO         ASBO        ASBO-FUO
      __consumer_offsets-15  316G  13833236    13967415    2228192991  2.214.225.576
      __consumer_offsets-15  316G  13833236    13967415    2228916349  2.214.948.934
      __consumer_offsets-24  254G  905365695   905504812   2983202580  2.077.697.768
      __consumer_offsets-36  292G  569850938   710582242   2289537500  1.578.955.258 
      __consumer_offsets-5   361M  107086630   108745431   108745431   0
      __consumer_offsets-30  180M  1316191746  1316191746  1316191746  0
      
      FDO: firstDirtyOffset
      FUO: firstUncleanableOffset
      ASBO: activeSegment.baseOffset
      

      The FUO never changes and it is equal to the LSO. The LogCleaner doesn't clean beyond the LSO, so the dirty ratio (DR) remains constant and the partition never becomes the filthiest. For that reason, it is never selected for cleaning.

      Dumping the __consumer_offsets-15 segment that includes the FUO (13967415), we were able to identify an open/hanging transaction using klog txn-stat.

      Filtering by the open TX's PID (173605), you can see that the data message (offset 13967415) related to the last abort control message (offset 13967412) arrives out of order.

      # committed transaction
      Batch(baseOffset=13967365, lastOffset=13967365, count=1, baseSequence=0, lastSequence=0, producerId=173605, producerEpoch=0, partitionLeaderEpoch=142, isTransactional=true, isControl=false, position=14332658, createTime=2020-11-26T21:03:11.731Z, size=158, magic=2, compressCodec='NONE', crc=-1505839980, isValid=true)
        DataMessage(offset=13967365, createTime=2020-11-26T21:03:11.731Z, keySize=64, valueSize=24, sequence=0, headerKeys='')
      Batch(baseOffset=13967366, lastOffset=13967366, count=1, baseSequence=-1, lastSequence=-1, producerId=173605, producerEpoch=0, partitionLeaderEpoch=142, isTransactional=true, isControl=true, position=14332816, createTime=2020-11-26T21:03:11.746Z, size=78, magic=2, compressCodec='NONE', crc=814189139, isValid=true)
        ControlMessage(offset=13967366, createTime=2020-11-26T21:03:11.746Z, keySize=4, valueSize=6, sequence=-1, headers='', commit=commit, coordinatorEpoch=146)
      
      # aborted transaction
      Batch(baseOffset=13967412, lastOffset=13967412, count=1, baseSequence=-1, lastSequence=-1, producerId=173605, producerEpoch=1, partitionLeaderEpoch=143, isTransactional=true, isControl=true, position=14337223, createTime=2020-11-26T21:04:25.783Z, size=78, magic=2, compressCodec='NONE', crc=716648242, isValid=true)
        ControlMessage(offset=13967412, createTime=2020-11-26T21:04:25.783Z, keySize=4, valueSize=6, sequence=-1, headers='', commit=abort, coordinatorEpoch=146)
      
      # open transaction
      Batch(baseOffset=13967415, lastOffset=13967415, count=1, baseSequence=0, lastSequence=0, producerId=173605, producerEpoch=0, partitionLeaderEpoch=143, isTransactional=true, isControl=false, position=14337457, createTime=2020-11-26T21:04:33.837Z, size=158, magic=2, compressCodec='NONE', crc=-1790524556, isValid=true)
        DataMessage(offset=13967415, createTime=2020-11-26T21:04:33.837Z, keySize=64, valueSize=24, sequence=0, headerKeys='')
      

      This looks similar to what is described in KAFKA-12671.

      The immediate workaround to this issue would be to delete the topic and recreate it. This cannot be done easily, especially when previous data needs to be copied, and when the affected topic is the internal __consumer_offsets. Delete the old offsets up to the hanging TX would probably not reset the LSO. Aborting the hanging TX with kafka-transactions.sh (Kafka 3.0+) or rewriting the affected segments are the safest options.

              Unassigned Unassigned
              rhn-support-fvaleri Federico Valeri
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: