-
Task
-
Resolution: Won't Do
-
Critical
-
None
-
None
-
None
-
None
-
False
-
None
-
False
-
No
-
MGDSRVS-336 - Keep Openshift Streams components up-to-date
-
---
-
---
-
-
WHAT
As discussed on MGDSTRM-9976, RHOSAK currently configures the strimzi cluster operator deployment to have 1 replica with 3 cpus assigned to the container.
When running under StrimziPodSets, the recommendation to achieve high availability is to have at least two replicas.
If we were to simply increase the number of replicas, with the current CPU assignment that would result in the following CPU consumption, which is excessive.
2 (strimzi deployments per bundle (old/new)) * 2 (strimzi cluster operator replicas) * 3 cpus = 12 cpus per cluster.
The current cpu limit comes from Red Hat Summit when we were testing with very large numbers of kafka instances. Strimzi has a thundering herd problem - the reconciliations occur in wave that gives a spikey CPU usage pattern. cpus=3 was trying to give sufficient CPU to deal accommodate the reconciliation spike.
To allow us to move forward with podsets, we can probably tune down the CPU limits to be commensurate with current demands.
Longer term: lets work to resolve https://github.com/strimzi/strimzi-kafka-operator/issues/7373 which should allow CPU demands to be reduced further (not part of this JIRA).
WHY
Enablement of Strimzi PodSets in a manner that achieves high availability.
HOW
- Work out what CPU demands will look like under podsets, given the worst case production workload.
- We should look at both development and standard.
-
- https://gitlab.cee.redhat.com/service/app-interface/-/blob/master/data/services/managed-services/cicd/saas/saas-kas-fleet-manager.yaml#L713
- (KW) I'd suggest we restrict the testing to steady state load. By this I mean, let's not consider the case where 100 developer kafka instances get provisioned at once, as this is really unlikely.
- Include a margin of safety
- Use this to inform that CPU limit for the Strimzi container.
- Consider adding an alert for strimzi cpu container throttling.
DONE
- blocks
-
MGDSTRM-10079 Increase strimzi operator replicas to 2 & enable strimzi leadership election
- Closed
- relates to
-
MGDSTRM-9976 Rationalize strimzi resources
- Closed