When Cluster CA is replaced, it follows a multi-stage process:
- First a trust for the new CA is rolled our to all components of the Kafka cluster
- Then, the new CA is used to generate the new server component certificates for all components of the Kafka cluster and roll them out to the components
- The old CA is removed and all components of the Kafka cluster are rolled to remove the trust
In parallel to this, you might have some other operands such as Connect, Mirror Maker, etc. TLS in these compoenents are typically configured by pointing it to a secret and a key in the secret. The secret will be typically the <cluster-name>-cluster-ca-cert secret and the key will be ca.crt which is the public key of the Cluster CA. If running in different namespace, it might be the same secret, just copied by some tool to another namespace.
The problem is, that the ca.crt is updated already in step 1 (see above) to contain the new Cluster CA public key. Strimzi CO will see that the ca.crt changed in the next reconciliation and roll the other operands to use the updated secret. However, at this point in time, the CA replacement is still in the first phase, rolling out the trust to the new CA. But the other operands rolled to use the new CA already expect the KAfka cluster to be done with step 2 to successfully connect. So they will end up in crash-looping state.
This can be probably addressed in two ways =>
- Update the ca.crt only once the new CA is rolled out (i.e. after step 2). But this might not really solve it since the other operands such as Connect servers would in such case not trust the new server certs. So they should really follow the lifecycle used to roll-out the CA in the first place ... first start trusting both old and new CA, and after the rollout stop trusting the old CA.
- So maybe better option is to allow the operands to load all certs from a single secret. For example by omitting the certificate key:
trustedCertificates: - secretName: my-cluster-cluster-ca-certThe operator will configure the operand to load all certs form the Secret. So by default, the Connect operand will trust onyl the single ca.crt from the secret. In phase one, when the new CA is added, the CO will roll the other operands such as Connect as well, this time to trust both CAs (new / old). That will allow Connect to work through the phase 2 when both new and old CAs are being used. And finally in step 3, when the old CA is removed, it will again roll Connect to trust only the ca.crt again which will be the new CA.
Created by Strimzi#7726.