[ENTMQST-4818] [FIPS] Certificate renewal is not working properly on OCP FIPS clusters

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: 2.4.0.GA
Affects Version/s: 2.4.0.GA
Component/s: None
Labels:
None

Epic Link:
2.4.0 Productization
Blocked:
False
Blocked Reason:
None
Ready:
False
Target Release:

2.4.0.GA
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

It seems that the force certificate renewal is not working on the OCP clusters with FIPS enabled.
Renewal is triggered by strimzi.io/force-replace.
The resources - Kafka, ZK, EO - should do three rolls to renew their certificates, the two rolls are executed without a problem, but the third roll is not completely finished and CO contains errors (the attachment contains the full operator log).

The issue was discovered by test in `SecurityST#testAutoReplaceAllCaKeysTriggeredByAnno`.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

co.log
5.40 MB
2023/04/17 1:46 PM

Paul Mellor added a comment - 2023/04/27 2:07 PM

Docs updated as follows:

Note added to FIPS support section upstream and downstream (https://github.com/strimzi/strimzi-kafka-operator/pull/8412)
Note in 2.4 Release Notes for Improved FIPS support enhancement

Paul Mellor added a comment - 2023/04/27 2:07 PM Docs updated as follows: Note added to FIPS support section upstream and downstream ( https://github.com/strimzi/strimzi-kafka-operator/pull/8412 ) Note in 2.4 Release Notes for Improved FIPS support enhancement

JAkub Scholz added a comment - 2023/04/19 8:12 PM

The current plan for this issue seems to be:

Backport https://issues.redhat.com/browse/ENTMQST-4821 to 2.4.0 (simple fix => easy to backport, reasonable risk)
Add warnings to docs / release notes to increase the memory to at least 512Mi when on FIPS enabled cluster. This should minimize the risk of any of these issues happening
Postpone https://issues.redhat.com/browse/ENTMQST-4822 for 2.5.0 (complicated fix, would need time to develop, hard to backport and the risk of breaking something else under the pressure would be considerable). Increasing the resources when running on FIPS should serve as a mitigation wich prevents this from happening (given the bug it self is present for several releases without being known).

JAkub Scholz added a comment - 2023/04/19 8:12 PM The current plan for this issue seems to be: Backport https://issues.redhat.com/browse/ENTMQST-4821 to 2.4.0 (simple fix => easy to backport, reasonable risk) Add warnings to docs / release notes to increase the memory to at least 512Mi when on FIPS enabled cluster. This should minimize the risk of any of these issues happening Postpone https://issues.redhat.com/browse/ENTMQST-4822 for 2.5.0 (complicated fix, would need time to develop, hard to backport and the risk of breaking something else under the pressure would be considerable). Increasing the resources when running on FIPS should serve as a mitigation wich prevents this from happening (given the bug it self is present for several releases without being known).

JAkub Scholz added a comment - 2023/04/17 10:24 PM

After some more analysis:

This is not a regression in AMQ Streams 2.4.0. It is present in previous versions as well. However, the FIPS mode seems to have higher memory consumption (given there are no FIPS-specific code paths, this seems like a feature of the FIPS-enabled OpenJDK and the modules it uses such as Sun PKCS11?). This seems to make it fail more often on FIPS (in my tests almost every time) at different points by running out of memory.
There seem to be at least two separate issues in the code depending on when the CO dies
- https://issues.redhat.com/browse/ENTMQST-4821 seems like might be easier to fix
- https://issues.redhat.com/browse/ENTMQST-4822 will be more complicated to fix

JAkub Scholz added a comment - 2023/04/17 10:24 PM After some more analysis: This is not a regression in AMQ Streams 2.4.0. It is present in previous versions as well. However, the FIPS mode seems to have higher memory consumption (given there are no FIPS-specific code paths, this seems like a feature of the FIPS-enabled OpenJDK and the modules it uses such as Sun PKCS11?). This seems to make it fail more often on FIPS (in my tests almost every time) at different points by running out of memory. There seem to be at least two separate issues in the code depending on when the CO dies https://issues.redhat.com/browse/ENTMQST-4821 seems like might be easier to fix https://issues.redhat.com/browse/ENTMQST-4822 will be more complicated to fix

JAkub Scholz added a comment - 2023/04/17 4:29 PM

There seem to be still gaps in the renewal logic where it does not survive the crash of the operator. For example:

Set the force-replace annotations
Wait for the operator to start the initial rolling update
Kill the operator while CaReconciler.rollingUpdateForNewCaKey is rolling the pods => At that point, the CA certificates have the bumped generation, but the leaf secrets (Zoos, brokers, ...) are still the old ones
The new operator sees the new generations on the CA secrets, but does not see the CA renewals in progress and when it gets to roll the Zoos or Kafkas, it does not renew the certificates, it just bumps the generation in the secret and rolls the pods. The pods work reasonably fine at this point, because the old CA public key is still trusted by them. So the roll completes successfully, but the pods are still using server certs signed by the old CA.
In the next reconciliation, the CaReconciler.maybeRemoveOldClusterCaCertificates sees that all pods were rolled and deletes the old CA. But because the pods are using the server certs signed by old CA, this breaks it and makes it fail.

This does not seem to be unique to FIPS in any way. We should probably (aside of fixing the gaps which will be non-trivial) investigate:

If this is a regression or if this was always there
Why does it show more in FIPS => does FIPS use more memory? Do we need to recommend increasing memory to users on FIPS?

JAkub Scholz added a comment - 2023/04/17 4:29 PM There seem to be still gaps in the renewal logic where it does not survive the crash of the operator. For example: Set the force-replace annotations Wait for the operator to start the initial rolling update Kill the operator while CaReconciler.rollingUpdateForNewCaKey is rolling the pods => At that point, the CA certificates have the bumped generation, but the leaf secrets (Zoos, brokers, ...) are still the old ones The new operator sees the new generations on the CA secrets, but does not see the CA renewals in progress and when it gets to roll the Zoos or Kafkas, it does not renew the certificates, it just bumps the generation in the secret and rolls the pods. The pods work reasonably fine at this point, because the old CA public key is still trusted by them. So the roll completes successfully, but the pods are still using server certs signed by the old CA. In the next reconciliation, the CaReconciler.maybeRemoveOldClusterCaCertificates sees that all pods were rolled and deletes the old CA. But because the pods are using the server certs signed by old CA, this breaks it and makes it fail. This does not seem to be unique to FIPS in any way. We should probably (aside of fixing the gaps which will be non-trivial) investigate: If this is a regression or if this was always there Why does it show more in FIPS => does FIPS use more memory? Do we need to recommend increasing memory to users on FIPS?

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

Collapse comment: Paul Mellor added a comment - 2023/04/27 2:07 PM

Expand comment: Paul Mellor added a comment - 2023/04/27 2:07 PM

Collapse comment: JAkub Scholz added a comment - 2023/04/19 8:12 PM

Expand comment: JAkub Scholz added a comment - 2023/04/19 8:12 PM

Collapse comment: JAkub Scholz added a comment - 2023/04/17 10:24 PM

Expand comment: JAkub Scholz added a comment - 2023/04/17 10:24 PM

Collapse comment: JAkub Scholz added a comment - 2023/04/17 4:29 PM

Expand comment: JAkub Scholz added a comment - 2023/04/17 4:29 PM

People

Dates