-
Task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
EnVision Sprint 41, EnVision Sprint 42, EnVision Sprint 43
From email from Brandon Squizzato
A few weeks ago on Oct 16 there was a kafka outage in crc-stage related to AWS MSK maintenance. There is another maintenance window coming up and we may see a reboot of the kafka cluster again. Here is the plan:
We have submitted a request to push the next scheduled maintenance out to Nov 28 at time 6pm - 11pm EST – this should be a pretty low-impact time window.
However, this is a situation that apps will need to be able to handle gracefully, especially once we start using MSK in production. We are working to enable apps to be able to operate properly during an MSK maintenance window, here's how:
1. When AWS reboots a broker cluster, it does so in a rolling fashion (see https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html). We have 3 brokers in our MSK cluster. However, currently Clowder is only giving apps one broker hostname to connect to. So, we have just merged a PR into Clowder that will cause Clowder to begin passing multiple broker hostnames to apps instead of just one: https://github.com/RedHatInsights/clowder/pull/881
NOTE: the above change should not cause any regression issues because the cdappconfig 'kafka' field is already a list/array type. There is no change to the cdappconfig schema – the only thing that changes is that Clowder may begin listing multiple BrokerConfig items in that array instead of just 1 item.
2. We are working on rolling out the above change. Once it is released in stage, we will open a PR on a couple apps this week to modify their kafka producer/consumer configuration to demonstrate how to connect to multiple brokers instead of one. In theory, it should be a relatively simple change: every kafka client library we are aware of already supports connecting to multiple brokers – apps will now have to pass all 3 broker hostnames that Clowder gives them to the client instead of just 1. We'll make sure this is as simple as expected and then advertise these PR's as an example that other apps can follow.
3. If apps are able to get the changes in for all apps before Nov 28, your app should be able to handle the MSK maintenance window without issue. If apps cannot get these changes in by Nov 28, then there may be some instability in stage kafka during the AWS maintenance window – but again, we have set the time window for minimal impact.
4. As teams get changes merged to support multiple brokers, we can use this simple spreadsheet to track which apps support multiple brokers: https://docs.google.com/spreadsheets/d/19f_BbiBrXVV9CiLr6OIbzwB4TMz02C-XP_HEo_EMHQU/edit#gid=0
5. In future we can test to ensure all apps stay operational during maintenance – either by manually rebooting the MSK cluster for testing, OR by waiting until the next maintenance window.