-
Bug
-
Resolution: Obsolete
-
Critical
-
1.0.0.OCP.GA, 1.0.0.RHEL.GA, 1.0.1.GA, 1.1.0.GA, 1.1.1.GA, 1.2.0.GA, 1.3.0.GA, 1.4.0.GA, 1.4.1.GA, 1.5.0.GA, 1.6.0.GA, 1.6.2.GA
-
False
-
False
-
-
We are investigating an issue on AMQ Streams 1.7 (Kafka 2.7) on RHEL, with some __consumer_offsets partitions that are never cleaned, causing the disk to fills up and long CGs recovering times when a broker is restarted.
Focusing on one of these partitions (__consumer_offsets-15), we identified the root cause as a stuck last stable offset (LSO), due to a hanging transaction. Note that the issue on this specific partitions dates back to 2020-11-26T21:04:33.837Z, when the customer was running Kafka 2.2. The other affected partitions are probably around that time (to be confirmed).
The following report is derived from LogCleanerManager DEBUG logs. You can see that the range between FUO and ASBO is growing proportionally to the partition size, while it is zero in all non-affected partitions.
PARTITION SIZE FDO FUO ASBO ASBO-FUO __consumer_offsets-15 316G 13833236 13967415 2228192991 2.214.225.576 __consumer_offsets-15 316G 13833236 13967415 2228916349 2.214.948.934 __consumer_offsets-24 254G 905365695 905504812 2983202580 2.077.697.768 __consumer_offsets-36 292G 569850938 710582242 2289537500 1.578.955.258 __consumer_offsets-5 361M 107086630 108745431 108745431 0 __consumer_offsets-30 180M 1316191746 1316191746 1316191746 0 FDO: firstDirtyOffset FUO: firstUncleanableOffset ASBO: activeSegment.baseOffset
The FUO never changes and it is equal to the LSO. The LogCleaner doesn't clean beyond the LSO, so the dirty ratio (DR) remains constant and the partition never becomes the filthiest. For that reason, it is never selected for cleaning.
Dumping the __consumer_offsets-15 segment that includes the FUO (13967415), we were able to identify an open/hanging transaction using klog txn-stat.
Filtering by the open TX's PID (173605), you can see that the data message (offset 13967415) related to the last abort control message (offset 13967412) arrives out of order.
# committed transaction Batch(baseOffset=13967365, lastOffset=13967365, count=1, baseSequence=0, lastSequence=0, producerId=173605, producerEpoch=0, partitionLeaderEpoch=142, isTransactional=true, isControl=false, position=14332658, createTime=2020-11-26T21:03:11.731Z, size=158, magic=2, compressCodec='NONE', crc=-1505839980, isValid=true) DataMessage(offset=13967365, createTime=2020-11-26T21:03:11.731Z, keySize=64, valueSize=24, sequence=0, headerKeys='') Batch(baseOffset=13967366, lastOffset=13967366, count=1, baseSequence=-1, lastSequence=-1, producerId=173605, producerEpoch=0, partitionLeaderEpoch=142, isTransactional=true, isControl=true, position=14332816, createTime=2020-11-26T21:03:11.746Z, size=78, magic=2, compressCodec='NONE', crc=814189139, isValid=true) ControlMessage(offset=13967366, createTime=2020-11-26T21:03:11.746Z, keySize=4, valueSize=6, sequence=-1, headers='', commit=commit, coordinatorEpoch=146) # aborted transaction Batch(baseOffset=13967412, lastOffset=13967412, count=1, baseSequence=-1, lastSequence=-1, producerId=173605, producerEpoch=1, partitionLeaderEpoch=143, isTransactional=true, isControl=true, position=14337223, createTime=2020-11-26T21:04:25.783Z, size=78, magic=2, compressCodec='NONE', crc=716648242, isValid=true) ControlMessage(offset=13967412, createTime=2020-11-26T21:04:25.783Z, keySize=4, valueSize=6, sequence=-1, headers='', commit=abort, coordinatorEpoch=146) # open transaction Batch(baseOffset=13967415, lastOffset=13967415, count=1, baseSequence=0, lastSequence=0, producerId=173605, producerEpoch=0, partitionLeaderEpoch=143, isTransactional=true, isControl=false, position=14337457, createTime=2020-11-26T21:04:33.837Z, size=158, magic=2, compressCodec='NONE', crc=-1790524556, isValid=true) DataMessage(offset=13967415, createTime=2020-11-26T21:04:33.837Z, keySize=64, valueSize=24, sequence=0, headerKeys='')
This looks similar to what is described in KAFKA-12671.
The immediate workaround to this issue would be to delete the topic and recreate it. This cannot be done easily, especially when previous data needs to be copied, and when the affected topic is the internal __consumer_offsets. Delete the old offsets up to the hanging TX would probably not reset the LSO. Aborting the hanging TX with kafka-transactions.sh (Kafka 3.0+) or rewriting the affected segments are the safest options.
- links to