Features and capabilities for installing, managing, upgrading and removing Operators via the Operator Lifecycle Manager and Operator Hub

Goal: Provide OpenShift users with an easy to understand and repeatable way to add additional capabilities and features to their OpenShift clusters.

Background on Operators: All OpenShift-layered products are implemented as operators. On top of that there is an ecosystem of third-party operators from well-known ISVs and the open source community. Operators resemble cloud services running directly on clusters. They extend the basic concepts of container-based applications on Kubernetes with higher level services, that are often required to run these applications: databases, message queues, service meshes, build pipelines, security and network policies. They do that in a way that is native to Kubernetes, by introducing new object types and APIs that can be used with existing tooling like kubectl.

Customer benefit from Operators: Customers get these additional capabilities delivered by an operator in an standardized, and highly automated fashion. A lot of the typically manually implemented day 2 operations are handled by operators and can be triggered via GitOps by directly manipulating Kubernetes-level objects. Compared to standard packaging and templating technologies no external tools are needed. Complex orchestration and sequence steps are automated. OpenShift provides a distinct advantage in using operators over standard templating technologies to bring additional workloads to the cluster: the workloads are offered like a cloud service, independent of the underlying infrastructure or whether the environment is internet connected or not. Their consumption can therefore easily be standardized from the first to the 100th cluster with a high degree on day 1 and day 2 automation.

Background on Operator Lifecycle Management: While operators deliver a powerful concept to deliver and consume services like a cloud service, Kubernetes lacks some required controls to use them at scale across many clusters and across multiple tenants on a single cluster. Out of the box, Kubernetes does not provide a native way to list and maintain a version matrix of all installed operators. Kubernetes does not provide users on the cluster any clue that an operator is installed and is available to them, because they lack the required privileges. The installation, permission management and updates are manual and left for the user to assemble. Especially the lifecycle of Custom Resource Definitions, a core component of the operator pattern, is complex and requires knowledge of how the Kubernetes API server and API schemas work. There is no tooling in the CNCF ecosystem that addresses this. This is why the Operator Lifecycle Manager exists.

Customer benefit from Operator Lifecycle Manager: With OLM, customers don't need to be experts in the intricacies of installing, updating, and managing API extensions. These subject areas are abstracted away from them behind a package management concept that they are likely familiar with from operating system package manager. For cluster administrators OLM provides easy to understand high level controls and flows to add, remove and update operators as they are running on the cluster. These activities are highly automated and declarative, because OLM itself is also an operator. OLM prevents admins to put their cluster into a non-functioning state by avoiding conflicts between operators or data loss when changing CRDs. For cluster users OLM provides an easy way to find out if an operator is available to them and how to use it. In short, OLM makes operators more consumable for tenants by lowering the learning curve and makes it possible for administrators to run many multi-tenant clusters with many operators installed.

Why is this important now:

In the previous 3 years we learnt a lot around how users ideally want to manage and consume operator-based services on the cluster. The Kubernetes ecosystem also evolved in the way these operators can extend the cluster. What started as simple new object definitions that just introduced a new object name to the cluster, without an API specs, versioning or schema validation known as Third Party Resources, evolved into a fully feature API lifecycle system, called Custom Resource Definitions.

OLM, when initially conceived made some assumptions about TPRs that are no longer true with modern CRDs. Such as that TPRs would only have a name, but no schema that could create conflicts. When CRDs and schemas were introduced, the community anticipated that in turn CRDs would not remain global entities but become namespace'd. These changes never materialized in the upstream Kubernetes community. The Kubernetes control plane is still not namespaced. This is the heritage of OLMs current namespace scoping model. With today's CRDs this allows users to run into situation where they are blocked from installing or updating another operator. It also does not provide sufficient control over operator permissions and access management because these concepts are entangled with the namespace-scoping of the operator. As a result users are forced to install operators with a namespace scope, which down the line create the situations where the user is blocked, as mentioned earlier.

Also, initially OLM strived to be very opinionated in the way it guides cluster administrators towards updating operators - always the newest version should be applied. In practice, users are usually somewhat behind the latest versions due to their own testing and rollout cycles. OLM lacks first-class support for choosing versions to install and update to along the update path.

Finally, across large fleets of clusters, OLM APIs show some gaps and leaking abstraction. Actually not all of them are fully declarative yet, which creates problems in GitOps-style provisioning and management workflows. The catalog maintenance workflow for operator authors to publish new versions of their product are also largely imperative today and require sophisticated pipelines to release properly

Combined, these issues lead to adoption and management problems of operators in OpenShift, which may prevent the customers from realizing the benefits associated with this concept. We need to address those issues.

Execution Plan: In order to close the above gaps, align OLMs model with the current capabilities of CRDs and improve the UX for cluster admins, users and operator authors, two major initiatives have been formed:

conceive a completely new version, OLM v1, with a redesigned lifecycle model and APIs (PRD)
conceive a new way of maintaining operator catalogs declaratively: File-Based Catalogs

This ticket serves as a high level tracker of all the activities associated to deliver these two initiatives above.

is related to

OCPSTRAT-268 [GA release] OLM 1.0 Behavioral requirements - Simplified API control surface