Loading...

Type: Enhancement
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- from-cssre
- unplanned

Story Points:
3
Blocked:
False
Blocked Reason:
None
Ready:
False
Discussed with Team:
No
[QE] How to address?:
---
[QE] Why QE missed?:
---

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

During this incident https://issues.redhat.com/browse/OHSS-11275 SOPs were run and the following items were observed.

NOTE: This ticket is to;
1. ) Automate some parts of the SOP whenever possible
2. ) Investigate if there are errors needed to be addressed in the SOP and provide enhancements.

As you will notice in the following observations, there will be a lack of information due to the nature of the SRE incident management process. Usually, if there is an issue along the way when running a SOP, an SRE will not have time to debug the SOP-related issue as the main focus is to find a resolution for the current incident. There is usually not enough time to solve a SOP-related issue as it can unnecessarily take up MTTR and therefore the error budget as well. Please don't hesitate to contact SRE team to ask for more information.

1. In the SOP https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/blob/main/sops/alerts/partition_under_replicated.asciidoc

The following command got stuck for a while and didn't return any error, I (jcueto) stopped the command as it is taking up time in the incident management process. Perhaps approximately after 8 mins, I stopped the script. As mentioned there were errors when the command was run. The issue could be due to the brokers being in an unstable state. *Could you please investigate in the script *involved in this command whether it is resilient for such situations (brokers unstable) and if they are still reliable to be used during incident management? Could you also please enhance it if necessary?

KAFKA_CLUSTER=$(oc -n kafka-${KAFKA_ID} get kafka --no-headers | awk '{print $1}')
oc -n kafka-${KAFKA_ID} exec -it statefulset/${KAFKA_CLUSTER}-kafka -c kafka -- env - bin/kafka-topics.sh --bootstrap-server localhost:9096 --describe --under-min-isr-partitions

There is also a part where the output of this command (see below) needs to be inspected to determine the min ISR. What happens if the output of the command has more than 50 lines? How to determine the min ISR properly and accurately? See actual instance of this below, I think that this particular step needs to be automated.

2. https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/blob/main/sops/alerts/partition_under_replicated.asciidoc

The following command returned 1k+ lines and it was almost impossible to analyze it while an incident is being dealt with as it unnecessarily increased the MTTR which in turn burns error budgets unnecessarily. Can we please automate the identification of for "under-replicated partitions and mni ISR"?

KAFKA_CLUSTER=$(oc -n kafka-${KAFKA_ID} get kafka --no-headers | awk '{print $1}')
oc -n kafka-${KAFKA_ID} exec -it statefulset/${KAFKA_CLUSTER}-kafka -c kafka -- env - bin/kafka-topics.sh --bootstrap-server localhost:9096 --describe --under-replicated-partitions

3. https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/blob/main/sops/kafka/restart_brokers_safely.asciidoc#restarting-when-kafka-in-a-degraded-state

When the two previous SOPs failed I decided to restart the brokers and looked into this SOP. I have noticed that the following step can take some time to run due to some manual analysis. The chance also of an SRE making a mistake in analyzing the leader ISR or whatsoever increases. Can we please automate the manual step here?

During an incident, the chance of increasing the time-to-recovery in the following step increases quickly as an SRE needs to analyze from an output which could be potentially be a long list. Apart from this, an SRE might make a mistake about determining the right leader and in-sync replica. Can we please automate this step? Can we offload the analysis part to the command/script?

Determine if it is safe to restart. A Broker can be restart safely in the following scenarios:

There are no partitions under minimum ISR and no partitions offline (safe to restart any broker). See Strimzi Kafka Grafana dashboard

The broker is not a leader for any partitions

# In this example broker kafka-0 can be restart as it is not the leader of any partitions (no Leader: 0 present in output)
oc exec -it ${KAFKA_BROKER_POD} -c kafka -- env - bin/kafka-topics.sh --bootstrap-server localhost:9096 --describe
Topic: __redhat_strimzi_canary	TopicId: -ei2yaHZRVaFiqLzCfwh_w	PartitionCount: 3	ReplicationFactor: 3	Configs: min.insync.replicas=2,segment.bytes=16384,retention.ms=600000,message.format.version=2.7-IV2
	Topic: __redhat_strimzi_canary	Partition: 0	Leader: 1	Replicas: 0,1,2	Isr: 0,2,1
	Topic: __redhat_strimzi_canary	Partition: 1	Leader: 1	Replicas: 1,2,0	Isr: 0,1,2
	Topic: __redhat_strimzi_canary	Partition: 2	Leader: 2	Replicas: 2,0,1	Isr: 0,2,1
Every partition for which the broker is a leader has at least one other broker with an in-sync replica

# In this example broker kafka-0 is the leader for Partition: 0. The partition has in-sync replicas on kafka-1 and kafka-2 so kafka-0 can be restarted
oc exec -it ${KAFKA_BROKER_POD} -c kafka -- env - bin/kafka-topics.sh --bootstrap-server localhost:9096 --describe
Topic: __redhat_strimzi_canary	TopicId: -ei2yaHZRVaFiqLzCfwh_w	PartitionCount: 3	ReplicationFactor: 3	Configs: min.insync.replicas=2,segment.bytes=16384,retention.ms=600000,message.format.version=2.7-IV2
	Topic: __redhat_strimzi_canary	Partition: 0	Leader: 0	Replicas: 0,1,2	Isr: 0,2,1
	Topic: __redhat_strimzi_canary	Partition: 1	Leader: 1	Replicas: 1,2,0	Isr: 0,1,2
	Topic: __redhat_strimzi_canary	Partition: 2	Leader: 2	Replicas: 2,0,1	Isr: 0,2,1

How
1. implement the 3 steps below suggested by Rob

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates