Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-6

Tokenized Auth Enablement for OLM-managed Operators on Cloud Providers

    XMLWordPrintable

Details

    • 47% 47%
    • 0

    Description

      Goal: Managed OpenShift consumers (e.g. ROSA, ARO, OSD) and self-managed OpenShift customers on cloud providers (AWS, Azure or GCP) can rely on layered product operators in the cluster to leverage short lived authentication token services (e.g. AWS STS, Azure managed identity, GCP workload identity) to authenticate with the cloud provider's API. The configuration experience to get an operator to do this is streamlined and standardized across all operators.

      Background: When components on OpenShift communicate with cloud provider APIs, they need to authenticate first. This can be done in two ways: a static and long-lived set of credentials generated specifically for the workload ahead of time, or the workload getting permissions to assume a role / policy generated ahead of time, requesting temporary authentication tokens at runtime that are refreshed regularly. In both cases the credentials are associated with the required permissions the workload needs on the cloud provider API. OpenShift core platform has adopted support for this on AWS with OCPPLAN-5656 and is adding support for this on Azure with OCPBU-8. OLM-managed operators have so far not seen a structured approach to enablement and configuration.

      Benefit for customers: Short-lived token authentication is considered more secure because in the event of the token accidentally leaking, the window in which it can be used for exploits is small. In this case, no reconfiguration of the workload is required to get a new token. Whereas with static, long-lived tokens, exploitation can happen over a potentially long period of time if the leakage goes undetected. Changing long-lived tokens when the leakage / exploitation is detected also causes administrative overhead. An OpenShift deployment where both the core platform and optional OLM-managed operators exclusively use short-lived tokens for API authentication is considered part of a strong security posture.

      Why is this important now: Customers increasingly prefer this short-lived token authentication for API access in their cloud accounts and start to enforce it via policies. A workload / product that doesn't support this typically requires an exception given by customers InfoSec team. Currently the support for this authentication method is fragmented across the layered product portfolio and the catalog of optional, OLM-managed operators. Only a subset support some form of short-lived token usage, but only for a specific cloud provider. The configuration experience to enable this varies between operators and is typically manual. If left unchanged, it could be seen as an adoption barrier. Additionally, Hypershift-based deployments are going to use short-lived token authentication exclusively and will have no first-class support for operators, that do not support it.

      Outcomes:

      • As a customer of OpenShift layered products, I need to be able to fluidly, reliably and consistently install and use OpenShift layered product Kubernetes Operators into my cluster clusters, while leveraging short-lived token authentication throughout my deployment.
      • As a customer of OpenShift on the AWS, Azure or GCP, overall I expect OpenShift as a platform to function equally well with tokenized authentication as it does with static, long-lived credentials. I expect the same from the Kubernetes Operators under the Red Hat brand (that need to reach cloud APIs) in that tokenized workflows are equally integrated and workable.
      • As the managed services, including Hypershift teams, offering a downstream opinionated, supported and managed lifecycle of OpenShift (in the forms of ROSA, ARO, OSD on GCP, Hypershift, etc), the OpenShift platform should have as close as possible, native integration with core platform operators when clusters use tokenized cloud auth, driving the use of layered products.
      • As the Hypershift team, where the only credential mode for clusters/customers is short-lived token authentication, the Red Hat branded Operators that must reach the cloud provider API, should be enabled to work with short-lived credentials in a consistent, and automated fashion that allows customer to use those operators as easily as possible, driving the use of layered products.

      Current Situation:

      • OLM-managed operators today are unable to request cloud credentials via OpenShifts Credential API when installed on a cluster with short-lived authentication enabled (currently AWS STS only, see OCPBU-8 for Azure related work). The CloudCredentialOperator component that would be used for this has currently no support for OLM-managed operators.
      • On managed Openshift (ARO, ROSA, OSD), enabling one of the OLM-managed operators when it is deployed in a managed cluster, customers are required to register the operator with OCM (OpenShift Cluster Manager) before installing it via OperatorHub or directly on the cluster.
        • Users are unaware of which operators request credentials
        • Users are not warned that operator installation will fail 
        • Users subsequently are unaware why the installation failed
        • Operators timeout waiting for credentials
        • Users should be informed of steps required for a successful installation
      • CloudCredentialOperator (CCO) doesn’t exist in HyperShift as of today
      • Any configuration for short-lived token support for OLM-managed operator installation is currently command-line only

      Execution Plan:

      Some of the below workstreams will be running in parallel. Proper product documentation and QE is part of all of them.

      Workstream 1 - CloudCredentialOperator-based flow for OLM-managed operators and AWS STS (OCPBU-559)

      • CCO gets a new mode in which it can reconcile STS credential request for OLM-managed operators
      • A standardized flow is leveraged to guide users in preparing their AWS IAM policies and roles with permissions that are required for OLM-managed operators 
      • A standardized flow is defined in which users can configure OLM-managed operators to leverage AWS STS
      • An example operator is used to demonstrate the end2end functionality
      • This will be not be backported

      Workstream 2 - CloudCredentialOperator-based flow for OLM-managed operators and Azure Identity (OCPBU-560)

      • CCO gets a new mode in which it can reconcile Azure Workload Identity credential request for OLM-managed operators
      • A standardized flow is leveraged to guide users in preparing their Azure IAM policies and roles with permissions that are required for OLM-managed operators 
      • A standardized flow is defined in which users can configure OLM-managed operators to leverage Azure Identity
      • An example operator is used to demonstrate the end2end functionality
      • This will be not be backported

      Workstream 3 - STS enablement for critical OLM-managed operators (OCPBU-563)

      • based on Workstream 1, the following operators will be enabled to support the standard configuration flow for STS:
        • ALB Operator
        • EFS Operator
        • OADP
        • Cluster Logging
      • these operators only support this flow on OCP 4.14 or newer

      Workstream 4 - Azure Identity enablement for critical OLM-managed operators (OCPBU-564)

      • based on Workstream 2, the following operators will be enabled to support the standard configuration flow for Azure Identity:
        • AFS Operator
        • OADP
        • Cluster Logging
      • these operators only support this flow on OCP 4.14 or newer

      Workstream 5 - OCP Console support for short-lived authentication flows (OCPSTRAT-70, OCPSTRAT-961, OCPSTRAT-962)

      • based on Workstream 1, 2 and 9, the installation and configuration experience for any OLM-managed operator using short-lived token authentication is streamlined using the OCP console
      • the OCP console helps in detecting when the cluster itself is already using short-lived tokens for core functionality
      • the OCP console helps discover operators capable of short-lived token authentication and their IAM permission requirements

      Workstream 6 - Continued STS enablement for OLM-managed operators (OCPBU-568)

      • Short-lived token authentication using AWS STS for: 3Scale, RHODS, RHODA, EFA and ACK operators, ODF, ACM, Ansible Automation Platform
      • these operators only support this flow on OCP 4.14 or newer

      Workstream 7 - Continued Azure Identity enablement for OLM-managed operators (OCPBU-569)

      • Short-lived token authentication using Azure Identiy for: 3Scale, RHODS, RHODA, ODF, ACM, Ansible Automation Platform
      • these operators only support this flow on OCP 4.14 or newer

      Workstream 8 - Standardized update flow for OLM-managed operators leveraging short-lived token authentication (OCPBU-570)

      • A standardized flow is derived to warn the user if during or after an update the required IAM permissions of an OLM-managed operator configured with short-lived token authentication change
      • this flow is implemented in the OCP console as well as in the OLM APIs

      Workstream 9 - CloudCredentialOperator-based flow for OLM-managed operators and GCP WIF (OCPSTRAT-922)

      • CCO gets a new mode in which it can reconcile a GCP credential request for OLM-managed operators
      • A standardized flow is leveraged to guide users in preparing their GCP IAM policies and roles with permissions that are required for OLM-managed operators 
      • A standardized flow is defined in which users can configure OLM-managed operators to leverage GCP WIF
      • The OCP console will be enabled to support this

      Workstream 10 - GCP WIF enablement for OLM-managed operators

      • Short-lived token authentication using GCP WIF for: 3Scale, RHODS, RHODA, Red Hat Quay, ODF, ACM and Ansible Automation Platform
      • these operators only support this flow on OCP 4.15 or newer

      Workstream 11 - Hypershift-enablement for short-lived token authentication flows with OLM-managed operators (OCPBU-571)

      • Hypershift-based clusters are capable of supporting the same flows for short-lived token authentication for OLM-managed operators as ROSA or self-managed OpenShift on AWS using AWS STS

      Workstream 12 - Native OLM support for Tokenized Operators for a better User Experience (OCPSTRAT-675)

      • provide a UX that requires manual preparation steps by the user and also leads to less code changes required in operators supporting this flow

      Open Questions:

      • Can we expect CCO to come to HyperShift?

      Definition of done:

      1. Main success scenario - high-level user story (using STS as an example)
        1. customer creates a ROSA STS or Hypershift cluster (AWS)
        2. customer wants basic (table-stakes) features such as AWS EFS, OADP or Logging
        3. customer discovers the cluster is in STS mode and the desired operators are STS-capable
        4. customer sees necessary tasks for preparing for the operator in OperatorHub from their cluster
        5. customer prepares AWS IAM/STS roles/policies in anticipation of the Operator they want, using what they get from OperatorHub
        6. customer's provides a very minimal set of parameters (AWS ARN of role(s) with policy) to the Operator's OperatorHub page
        7. The cluster can automatically setup the Operator, using the provided tokenized credentials and the Operator functions as expected
        8. Cluster and Operator upgrades are taken into account and automated
        9. The above steps 1-7 should apply similarly for Google Cloud and Microsoft Azure Cloud, with their respective token-based workload identity systems.
      2. Managed OpenShift scenarios - high-level user story
        1. The same as above, but the ROSA CLI would assist with AWS role/policy creation
        2. The same as above, but the oc CLI would assist with cloud role/policy management (per respective cloud provider for the cluster)

      Desired effect:

      • Growth is the acquisition of net new usage of the platform. This can be new workloads not previously able to be supported, new markets not previously considered, or new end users not previously served.
      • Retention is maintaining and expanding existing use of the platform. This can be more effective use of tools, competitive pressures, and ease of use improvements.
      • Both of growth and retention are the effect of this effort.
        • Customers have strict requirements around using only token-based cloud credential systems for workloads in their cloud accounts, which include OpenShift clusters in all forms.
          • We gain new customers from both those that have waited for token-based auth/auth from OpenShift and from those that are new to OpenShift, with strict requirements around cloud account access
          • We retain customers that are going thru both cloud-native and hybrid-cloud journeys that all inevitably see security requirements driving them towards token-based auth/auth.

      References

       

      Attachments

        Issue Links

          Activity

            People

              DanielMesser Daniel Messer
              rh-ee-adejong Aaren de Jong
              Scott Dodson Scott Dodson
              Aaren de Jong, Andrew Butcher, Brett Tofel, Daniel Geoffroy, Daniel Messer, Eric Fried, James Harrington, Ju Lim, Lance Galletti, Mike Worthington, Scott Dodson, Shreyans Mulkutkar, Victor Kareh, Will Gordon
              Tushar Katarki Tushar Katarki
              Votes:
              1 Vote for this issue
              Watchers:
              25 Start watching this issue

              Dates

                Created:
                Updated: