-
Epic
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
None
-
instance break glass
-
False
-
None
-
False
-
No
-
To Do
-
0% To Do, 0% In Progress, 100% Done
-
---
-
---
-
MK - Sprint 223, MK - Sprint 224, MK - Sprint 225, MK - Sprint 226
We were discussing the potential for kafka user to cause their kafka instance to go out of memory by exhausting producerids. This would lead to a OOM issue that would recur on restart. It is likely SRE would have no time to intervene before the service OOM again. It would be hard to recover an instance in this state with the tools we have today.
There's effort going on to address the root cause of the issue (preventing excessive allocation of producerids), but this is likely to have a long lead time.
In the meanwhile we need a mechanism to allow us to temporarily override a kafka broker's configuration so that SRE are able (with Engineering's help) examine the broker, diagnose the issue and possibly make interventions. To illustrate, a use-case with the producerid problem might be:
- Increase memory to allow the brokers to come up.
- Run tooling to confirm that excessive producerids is the root cause
- Temporarily lower transactional.id.expiration.ms to cause Kafka to flush out the accumulated producer ids from the system that cause the OOM
- Return the system to normal state.
- Work with customer to help them address the issues with the application that cause the producerid leak.
The ability to override is likely to be valuable.
- broker/zookeeper memory
- environment variables
- broker configuration