Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19472

Cluster Version Operator does not correctly reconcile SCC resources

XMLWordPrintable

    • Critical
    • No
    • 3
    • OTA 242, OTA 243, OTA 244
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previously, the Cluster Version Operator did not reconcile SecurityContextConstraints (SCC) resources as expected. Cluster Version Operator (CVO) now properly reconciles the Volumes field SecurityContextConstraints resources towards the state defined in the release image. User modifications to system SCC resources are tolerated.

      Future OCP versions will stop tolerating user modifications of system SCC resources (see OCPBUGS-19465), so CVO now imposes a minor version update gate (Upgradeable=False) when it detects a user-modified SCC. Users need to make workloads compliant to unmodified system SCC resources before they update to future minor OCP versions. See https://access.redhat.com/solutions/7033949 for more information.
      Show
      Previously, the Cluster Version Operator did not reconcile SecurityContextConstraints (SCC) resources as expected. Cluster Version Operator (CVO) now properly reconciles the Volumes field SecurityContextConstraints resources towards the state defined in the release image. User modifications to system SCC resources are tolerated. Future OCP versions will stop tolerating user modifications of system SCC resources (see OCPBUGS-19465 ), so CVO now imposes a minor version update gate (Upgradeable=False) when it detects a user-modified SCC. Users need to make workloads compliant to unmodified system SCC resources before they update to future minor OCP versions. See https://access.redhat.com/solutions/7033949 for more information.
    • Bug Fix
    • Proposed

      This is a clone of issue OCPBUGS-19465 / OCPBUGS-18386. The following is the description of the original issue:

      How reproducible:
      Always
      Steps to Reproduce:
      1. the Kubernetes API introduces a new Pod Template parameter (`ephemeral`)
      2. this parameter is not in the allowed list of the default SCC
      3. customer is not allowed to edit the default SCCs nor we have a  mechanism in  place to update the built in SCCs AFAIK
      4. users of existing clusters cannot use the new parameter without creating manual SCCs and assigning this SCC to service accounts themselves which looks clunky. This is documented in https://access.redhat.com/articles/6967808
      Actual results:
      Users of existing clusters cannot use ephemeral volumes after an upgrade
      Expected results:
      Users of existing clusters can use ephemeral volumes after an upgrade

      Solution for OCP 4.13

      The root issue of the bug is that CVO was (incorrectly) not reconciling SCC resources to the form specified by a manifest from release image. This has two consequences:

      1. Changes in SCC resources mandated by new OCP versions were not applied to the cluster during upgrade (this is the culprit of the bug as it was reported)
      2. User modifications of SCC resources were not reverted, allowing users to operate workloads on the clusters depending on the modified SCCs

      The CVO was fixed in development and in 4.14 to reconcile the SCC resources as intended, but because of the point (2) above, we do not want to deliver the same strict fix to 4.13 and earlier, to avoid breaking existing workloads depending on user-modified SCCs. We decided to ship the smallest possible fix for the bug as reported: only reconcile Volumes field to the {}union{} of items mandated by the manifest and existing state of the cluster. Do not reconcile any other fields (we haven't shipped updates to these fields' values in all SCCs in 4.12 and 4.13, so it should be safe to do so).

      To ensure safe update to 4.14 where CVO starts reconciling everything, the 4.13 backport make CVO detect any user modifications to SCC resources, and if any modification is detected, it flags Upgradeable=False to prevent clusters upgrading to 4.14 until the modifications are removed (which can include modifying the workloads so that they are able to execute with unmodified system SCCs). The Upgradeable=False message will link to KB article that will describe the resolution https://access.redhat.com/solutions/7033949.

      Testing notes

      These are the main items I recommend to test:

      1. Modify a SCC which is not annotated with release.openshift.io/create-only: "true", such as restricted-v2
        1. Modify any field other than Volumes. Both before and after the fix, the SCC should not be reconciled back to the form in manifest. After the fix, the Upgradeable=False guard should show up, mentioning the modified SCC's name
        2. Remove the allowPrivilegeEscalation field. Both before and after the fix, the field will be set to true immediately (=defaulting), no matter what is in the manifest. After the fix, if this field has a false value in the manifest, the Upgradeable=False guard should show up, mentioning the modified SCC's name
        3. Add an item to the Volumes field. Both before and after the fix, the SCC should not be reconciled back to the form in manifest (the added item should stay there). After the fix, the Upgradeable=False guard should show up, mentioning the modified SCC's name
        4. Delete an item from the Volumes field. Before the fix, the SCC is not reconciled back to the form in the manifest. After the fix, the SCC is reconciled back to the form in manifest. The Upgradeable=False guard does not show up.
        5. Add an item to the Volumes field and delete another one. Before the fix, the SCC is not reconciled at all. After the fix, the SCC should have the added item, but also the previously deleted item too. The Upgradeable=False guard should show up.
        6. Modify two such SCCs in a way that should flag Upgradeable=False. Make sure the message mentions both afterwards.
      2. While the Upgradeable=False guard is up...
        1. Delete the modified SCC. Before the fix, the SCC is recreated to the form in the manifest. After the fix, the SCC is recreated the same way and the Upgreadeable=False disappears.
        2. Manually edit the modified SCC and restore it to the form in the manifes. The Upgradeable=False disappears.
        3. While two SCCs are modified and mentioned in the Upgradeable=False message, delete one of them. After the fix, it gets recreated without the modification, the Upgradeable=False stays up but only mentions one SCC (the one that was not deleted & recreated), not two
        4. Force an upgrade to any recent 4.14 through with --force. The upgrade should proceed and finish. After the upgrade, there's no Upgradeable=False guard and the modified SCC is reconciled to the form in the used 4.14 payload image (modification is removed)
      3. Modify a SCC which does have the release.openshift.io/create-only: "true" annotation, such as restricted. The behavior for these should be identical before and after the fix.
        1. No matter what modifications you do, CVO never modifies the SCCs back to their manifest forms
        2. If you delete the allowPrivilegeEscalation field, it should get defaulted to true.
        3. If you delete such SCC entirely, it gets recreated to the manifest form
        4. No matter what you do with such SCC, it never causes Upgradeable=False to appear

            afri@afri.cz Petr Muller
            openshift-crt-jira-prow OpenShift Prow Bot
            Jian Li Jian Li
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: