Loading...

Type: Bug
Resolution: Obsolete
Priority: Critical
Fix Version/s: 1.6.4.GA, 1.7.0.GA
Affects Version/s: 1.0.0.OCP.GA, 1.0.0.RHEL.GA, 1.0.1.GA, 1.1.0.GA, 1.1.1.GA, 1.2.0.GA, 1.3.0.GA, 1.4.0.GA, 1.4.1.GA, 1.5.0.GA, 1.6.0.GA, 1.6.2.GA
Component/s: kafka-broker
Labels:
- transaction

Blocked:
False
Ready:
False
GSS Priority:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We are investigating an issue on AMQ Streams 1.7 (Kafka 2.7) on RHEL, with some __consumer_offsets partitions that are never cleaned, causing the disk to fills up and long CGs recovering times when a broker is restarted.

Focusing on one of these partitions (__consumer_offsets-15), we identified the root cause as a stuck last stable offset (LSO), due to a hanging transaction. Note that the issue on this specific partitions dates back to 2020-11-26T21:04:33.837Z, when the customer was running Kafka 2.2. The other affected partitions are probably around that time (to be confirmed).

The following report is derived from LogCleanerManager DEBUG logs. You can see that the range between FUO and ASBO is growing proportionally to the partition size, while it is zero in all non-affected partitions.

PARTITION              SIZE  FDO         FUO         ASBO        ASBO-FUO
__consumer_offsets-15  316G  13833236    13967415    2228192991  2.214.225.576
__consumer_offsets-15  316G  13833236    13967415    2228916349  2.214.948.934
__consumer_offsets-24  254G  905365695   905504812   2983202580  2.077.697.768
__consumer_offsets-36  292G  569850938   710582242   2289537500  1.578.955.258 
__consumer_offsets-5   361M  107086630   108745431   108745431   0
__consumer_offsets-30  180M  1316191746  1316191746  1316191746  0

FDO: firstDirtyOffset
FUO: firstUncleanableOffset
ASBO: activeSegment.baseOffset

The FUO never changes and it is equal to the LSO. The LogCleaner doesn't clean beyond the LSO, so the dirty ratio (DR) remains constant and the partition never becomes the filthiest. For that reason, it is never selected for cleaning.

Dumping the __consumer_offsets-15 segment that includes the FUO (13967415), we were able to identify an open/hanging transaction using klog txn-stat.

Filtering by the open TX's PID (173605), you can see that the data message (offset 13967415) related to the last abort control message (offset 13967412) arrives out of order.

# committed transaction
Batch(baseOffset=13967365, lastOffset=13967365, count=1, baseSequence=0, lastSequence=0, producerId=173605, producerEpoch=0, partitionLeaderEpoch=142, isTransactional=true, isControl=false, position=14332658, createTime=2020-11-26T21:03:11.731Z, size=158, magic=2, compressCodec='NONE', crc=-1505839980, isValid=true)
  DataMessage(offset=13967365, createTime=2020-11-26T21:03:11.731Z, keySize=64, valueSize=24, sequence=0, headerKeys='')
Batch(baseOffset=13967366, lastOffset=13967366, count=1, baseSequence=-1, lastSequence=-1, producerId=173605, producerEpoch=0, partitionLeaderEpoch=142, isTransactional=true, isControl=true, position=14332816, createTime=2020-11-26T21:03:11.746Z, size=78, magic=2, compressCodec='NONE', crc=814189139, isValid=true)
  ControlMessage(offset=13967366, createTime=2020-11-26T21:03:11.746Z, keySize=4, valueSize=6, sequence=-1, headers='', commit=commit, coordinatorEpoch=146)

# aborted transaction
Batch(baseOffset=13967412, lastOffset=13967412, count=1, baseSequence=-1, lastSequence=-1, producerId=173605, producerEpoch=1, partitionLeaderEpoch=143, isTransactional=true, isControl=true, position=14337223, createTime=2020-11-26T21:04:25.783Z, size=78, magic=2, compressCodec='NONE', crc=716648242, isValid=true)
  ControlMessage(offset=13967412, createTime=2020-11-26T21:04:25.783Z, keySize=4, valueSize=6, sequence=-1, headers='', commit=abort, coordinatorEpoch=146)

# open transaction
Batch(baseOffset=13967415, lastOffset=13967415, count=1, baseSequence=0, lastSequence=0, producerId=173605, producerEpoch=0, partitionLeaderEpoch=143, isTransactional=true, isControl=false, position=14337457, createTime=2020-11-26T21:04:33.837Z, size=158, magic=2, compressCodec='NONE', crc=-1790524556, isValid=true)
  DataMessage(offset=13967415, createTime=2020-11-26T21:04:33.837Z, keySize=64, valueSize=24, sequence=0, headerKeys='')

This looks similar to what is described in KAFKA-12671.

The immediate workaround to this issue would be to delete the topic and recreate it. This cannot be done easily, especially when previous data needs to be copied, and when the affected topic is the internal __consumer_offsets. Delete the old offsets up to the hanging TX would probably not reset the LSO. Aborting the hanging TX with kafka-transactions.sh (Kafka 3.0+) or rewriting the affected segments are the safest options.

links to

KAFKA-12671

KAFKA-13174

KAFKA-14402

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates