Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-4818

[FIPS] Certificate renewal is not working properly on OCP FIPS clusters

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • 2.4.0.GA
    • 2.4.0.GA
    • None
    • None

      It seems that the force certificate renewal is not working on the OCP clusters with FIPS enabled.
      Renewal is triggered by strimzi.io/force-replace.
      The resources - Kafka, ZK, EO - should do three rolls to renew their certificates, the two rolls are executed without a problem, but the third roll is not completely finished and CO contains errors (the attachment contains the full operator log).

      The issue was discovered by test in `SecurityST#testAutoReplaceAllCaKeysTriggeredByAnno`.

            [ENTMQST-4818] [FIPS] Certificate renewal is not working properly on OCP FIPS clusters

            Paul Mellor added a comment -

            Docs updated as follows:

            Paul Mellor added a comment - Docs updated as follows: Note added to  FIPS support section upstream and downstream ( https://github.com/strimzi/strimzi-kafka-operator/pull/8412 ) Note in 2.4 Release Notes for Improved FIPS support enhancement

            The current plan for this issue seems to be:

            • Backport https://issues.redhat.com/browse/ENTMQST-4821 to 2.4.0 (simple fix => easy to backport, reasonable risk)
            • Add warnings to docs / release notes to increase the memory to at least 512Mi when on FIPS enabled cluster. This should minimize the risk of any of these issues happening
            • Postpone https://issues.redhat.com/browse/ENTMQST-4822 for 2.5.0 (complicated fix, would need time to develop, hard to backport and the risk of breaking something else under the pressure would be considerable). Increasing the resources when running on FIPS should serve as a mitigation wich prevents this from happening (given the bug it self is present for several releases without being known).

            JAkub Scholz added a comment - The current plan for this issue seems to be: Backport https://issues.redhat.com/browse/ENTMQST-4821 to 2.4.0 (simple fix => easy to backport, reasonable risk) Add warnings to docs / release notes to increase the memory to at least 512Mi when on FIPS enabled cluster. This should minimize the risk of any of these issues happening Postpone https://issues.redhat.com/browse/ENTMQST-4822 for 2.5.0 (complicated fix, would need time to develop, hard to backport and the risk of breaking something else under the pressure would be considerable). Increasing the resources when running on FIPS should serve as a mitigation wich prevents this from happening (given the bug it self is present for several releases without being known).

            After some more analysis:

            • This is not a regression in AMQ Streams 2.4.0. It is present in previous versions as well. However, the FIPS mode seems to have higher memory consumption (given there are no FIPS-specific code paths, this seems like a feature of the FIPS-enabled OpenJDK and the modules it uses such as Sun PKCS11?). This seems to make it fail more often on FIPS (in my tests almost every time) at different points by running out of memory.
            • There seem to be at least two separate issues in the code depending on when the CO dies

            JAkub Scholz added a comment - After some more analysis: This is not a regression in AMQ Streams 2.4.0. It is present in previous versions as well. However, the FIPS mode seems to have higher memory consumption (given there are no FIPS-specific code paths, this seems like a feature of the FIPS-enabled OpenJDK and the modules it uses such as Sun PKCS11?). This seems to make it fail more often on FIPS (in my tests almost every time) at different points by running out of memory. There seem to be at least two separate issues in the code depending on when the CO dies https://issues.redhat.com/browse/ENTMQST-4821 seems like might be easier to fix https://issues.redhat.com/browse/ENTMQST-4822 will be more complicated to fix

            There seem to be still gaps in the renewal logic where it does not survive the crash of the operator. For example:

            1. Set the force-replace annotations
            2. Wait for the operator to start the initial rolling update
            3. Kill the operator while CaReconciler.rollingUpdateForNewCaKey is rolling the pods => At that point, the CA certificates have the bumped generation, but the leaf secrets (Zoos, brokers, ...) are still the old ones
            4. The new operator sees the new generations on the CA secrets, but does not see the CA renewals in progress and when it gets to roll the Zoos or Kafkas, it does not renew the certificates, it just bumps the generation in the secret and rolls the pods. The pods work reasonably fine at this point, because the old CA public key is still trusted by them. So the roll completes successfully, but the pods are still using server certs signed by the old CA.
            5. In the next reconciliation, the CaReconciler.maybeRemoveOldClusterCaCertificates sees that all pods were rolled and deletes the old CA. But because the pods are using the server certs signed by old CA, this breaks it and makes it fail.

            This does not seem to be unique to FIPS in any way. We should probably (aside of fixing the gaps which will be non-trivial) investigate:

            • If this is a regression or if this was always there
            • Why does it show more in FIPS => does FIPS use more memory? Do we need to recommend increasing memory to users on FIPS?

            JAkub Scholz added a comment - There seem to be still gaps in the renewal logic where it does not survive the crash of the operator. For example: Set the force-replace annotations Wait for the operator to start the initial rolling update Kill the operator while CaReconciler.rollingUpdateForNewCaKey is rolling the pods => At that point, the CA certificates have the bumped generation, but the leaf secrets (Zoos, brokers, ...) are still the old ones The new operator sees the new generations on the CA secrets, but does not see the CA renewals in progress and when it gets to roll the Zoos or Kafkas, it does not renew the certificates, it just bumps the generation in the secret and rolls the pods. The pods work reasonably fine at this point, because the old CA public key is still trusted by them. So the roll completes successfully, but the pods are still using server certs signed by the old CA. In the next reconciliation, the CaReconciler.maybeRemoveOldClusterCaCertificates sees that all pods were rolled and deletes the old CA. But because the pods are using the server certs signed by old CA, this breaks it and makes it fail. This does not seem to be unique to FIPS in any way. We should probably (aside of fixing the gaps which will be non-trivial) investigate: If this is a regression or if this was always there Why does it show more in FIPS => does FIPS use more memory? Do we need to recommend increasing memory to users on FIPS?

              Unassigned Unassigned
              lkral Lukas Kral
              Jakub Stejskal Jakub Stejskal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: