Loading...

XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Epic Name:
instance break glass
Blocked:
False
Blocked Reason:
None
Ready:
False
Discussed with Team:
No
Epic Status:
To Do
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
[QE] How to address?:
---
[QE] Why QE missed?:
---

Sprint:
MK - Sprint 223, MK - Sprint 224, MK - Sprint 225, MK - Sprint 226

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

We were discussing the potential for kafka user to cause their kafka instance to go out of memory by exhausting producerids. This would lead to a OOM issue that would recur on restart. It is likely SRE would have no time to intervene before the service OOM again. It would be hard to recover an instance in this state with the tools we have today.

There's effort going on to address the root cause of the issue (preventing excessive allocation of producerids), but this is likely to have a long lead time.

In the meanwhile we need a mechanism to allow us to temporarily override a kafka broker's configuration so that SRE are able (with Engineering's help) examine the broker, diagnose the issue and possibly make interventions. To illustrate, a use-case with the producerid problem might be:

Increase memory to allow the brokers to come up.
Run tooling to confirm that excessive producerids is the root cause
Temporarily lower transactional.id.expiration.ms to cause Kafka to flush out the accumulated producer ids from the system that cause the OOM
Return the system to normal state.
Work with customer to help them address the issues with the application that cause the producerid leak.

The ability to override is likely to be valuable.

broker/zookeeper memory
environment variables
broker configuration

Assignee:: Michael Edgar

Reporter:: Keith Wall

Team:: Kafka Fleet Services

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/08/03 2:57 PM

Updated:: 2022/11/03 12:42 PM

Resolved:: 2022/11/03 12:42 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates