• Product / Portfolio Work
    • OCPSTRAT-2841OpenShift.Next Applied/Agentic AI Experience
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview (aka. Goal Summary)  

      An elevator pitch (value statement) that describes the Feature in a clear, concise way.  Complete during New status.

      The MCP Lifecycle Operator provides a robust, declarative, Custom Resource Definition (CRD)-based API on Kubernetes to manage the full lifecycle of Model Context Protocol (MCP) Servers. It automates deployment, version upgrades, and ensures high availability with safe rollout strategies, replacing custom scripts and Helm charts for this critical AI/ML infrastructure component.

      Goals (aka. expected user outcomes)

      The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

      • As a SRE / Cluster Administrator, I want a a reliable and automated way to manage (CRUD) and update MCP servers.
      • As a Platform Engineer, I want to deploy, monitor, and manage the lifecycle of production MCP Servers.
      • The SRE/Cluster Administrator /Platform Engineer can define, deploy, and manage an MCP Server using a single, declarative MCPServer resource instead of writing multiple complex YAML manifests.
      • The SRE/Cluster Administrator /Platform Engineer expects new versions of MCP Servers can be safely rolled out using automated strategies like rolling updates.
      • The SRE/Cluster Administrator /Platform Engineer will have clear status and health reporting for all MCP Server deployments.
      • A controller would automatically handles complex integrations such as setting up network policies, registering with an MCP Gateway, and creating Prometheus service monitors.
      • The project will support Code Mode by integrating with the Agent Sandbox project to provide MCP Servers with network-isolated code sandboxes for running LLM-generated code.

      Requirements (aka. Acceptance Criteria):

      A list of specific needs or objectives that a feature must deliver in order to be considered complete.  Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc.  Initial completion during Refinement status.

      Functional Objectives:

      • Provide a standardized, CRD-based API (e.g. MCPServer) for defining, deploying, and managing MCP Servers.
      • Implement safe, automated rollout strategies for version upgrades.
      • Provide clear status and health reporting on all deployments.

      Nonfunctional Requirements (Security, Reliability, Maintainability, Focus):

      • Reliability & Robustness: The core reconciliation loop must be based on Kubernetes controller best practices to ensure a robust and reliable process.
      • Security: Must involve integration to set up network policies. Code Mode support must ensure safe execution of LLM-generated code via network-isolated sandboxes and explore VM-based security.
      • Maintainability & Community: Established as a simple, community-owned project, e.g. under the kubernetes-sigs organization or similar.
      • Focused Scope: Must remain tightly focused on the lifecycle management of the MCP Server application itself.

       

      Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both Both
      Classic (standalone cluster) Y
      Hosted control planes Y
      Multi node, Compact (three node), or Single node (SNO), or all All
      Connected / Restricted Network All
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) All
      Operator compatibility Y
      Backport needed (list applicable versions) N/A
      UI need (e.g. OpenShift Console, dynamic plugin, OCM) TBD
      Other (please specify)  

      Use Cases (Optional):

      Include use case diagrams, main success scenarios, alternative flow scenarios.  Initial completion during Refinement status.

      • As a Platform Engineer, I want to define and deploy an MCP Server, including its tools, networking, and resource limits, using a single, simplified custom resource definition (CRD) instead of multiple YAML manifests.
      • As a Platform Engineer, I want to safely upgrade a production MCP Server to a new version, relying on the MCP lifecycle operator to handle automated rollout strategies (like rolling updates) to prevent service interruption for dependent LLMs.
      • As a a Platform Engineer / SRE / Cluster Administrator, I want to deploy an MCP Server with built-in support for "Code Mode," automatically integrating with projects like Agent Sandbox to provide network-isolated environments for running LLM-generated code.
      • As a SRE / Cluster Administrator, I want centralized status and health reporting with clear insights on all MCP Server deployments within the cluster, enabling reliable monitoring of a critical component of the Applied/Agentic AI infrastructure.

      Questions to Answer (Optional):

      Include a list of refinement / architectural questions that may need to be answered before coding can begin.  Initial completion during Refinement status.

      • What is the strategy for integrating "Code Mode," specifically regarding the exploration and implementation of an "additional layer of security via VMs"?
      • What is the integration plan with MCP Gateway?
      • What are the plans with "AI-related WGs" and the "MCP SDKs community" to ensure the controller is a complementary tool and to facilitate the correlation of code execution requests to a sandbox?
      • Which Operator Lifecycle Tier to align with?
      • MCP Servers lifecycle

      Out of Scope

      High-level list of items that are out of scope.  Initial completion during Refinement status.

      • Building a MCP Gateway: It will not implement a Model Context Protocol (MCP) Gateway or similar functionality. It will focus on integration patterns with existing or proposed networking projects.
      • Tool Implementation: It will not implement the actual tools that the MCP Server exposes to LLMs. Its scope is limited to managing the MCP server's lifecycle.
      • Model Serving Infrastructure: It will not provision or manage LLM serving runtimes, e.g., KServe, Seldon, Triton.
      • Artifact Storage: It will not function as a model registry, MCP registry, or an OCI repository. It consumes container images from existing registries.

      Background

      Provide any additional context is needed to frame the feature.  Initial completion during Refinement status.

      MCP Lifecycle Operator Proposal

      Customer Considerations

      Provide any additional customer-specific considerations that must be made when designing and delivering the Feature.  Initial completion during Refinement status.

      • Moving beyond custom scripts or Helm charts to provide an automated, production-grade mechanism for managing for MCP Servers. Existing methods lack the reliability for lifecycle management (deployment, version upgrades, high availability).
      • Implementing Automated Rollouts with strategies like rolling updates to ensure new versions of MCP Servers can be deployed without breaking the capabilities of the LLMs that depend on them. Trying to do this manual or via script-based upgrades carry a high risk of service disruption for LLM applications.
      • A Declarative API via a single CRD to abstract away complex Kubernetes primitives, meaning the user "doesn't have to write multiple YAML manifests for Deployments, Services, and other CRDs avoids excessive boilerplate configuration and operational complexity.
      • Securely enabling the emerging Support Code Mode pattern by providing code sandboxes with "network isolation" and exploring security via VMs. This avoids the inherent security risk of running LLM-generated code in a cluster environment.
      • Providing clear and reliable Status and Health reporting on the progress and running version of each deployment, including integration for Prometheus service monitors, so SREs and Platform Engineers can monitor the operational state of the MCP Servers.

      Documentation Considerations

      Provide information that needs to be considered and planned so that documentation will meet customer needs.  If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

      • This will be included in the Release Notes
      • Documentation should include a guide for deploying the operator and the first sample CRD resource using the provided YAML spec.
      • Documentation should include detailed, step-by-step documentation on performing a safe server version upgrade, demonstrating the automated rollout feature and how to monitor its progress.
      • There should be published a complete, authoritative reference for the Custom Resource Definition, detailing every field in the spec (e.g., imagetransportresources), their valid values, and required settings.
      • Documentation covering the integration with Code Mode (Agent Sandbox) with a guide and configuration examples for enabling and securing the "Code Mode" feature, with clear instructions on utilizing the Agent Sandbox integration.
      • Instructions for configuring Prometheus service monitors and consuming the operator's metrics for cluster administrators and SREs (where applicable).
      • Guide for troubleshooting common issues for diagnosing deployment failures, rollout stalls, and connectivity issues, linking specific status conditions to potential root causes.

      Interoperability Considerations

      Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact?  What interoperability test scenarios should be factored by the layered products?  Initial completion during Refinement status.

      <your text here>

              julim Ju Lim
              julim Ju Lim
              None
              Ali Ok, Gaurav Singh, Gordon Sim, Lukas Berk, Marc Curry, Matthias Wessendorf, Mrunal Patel, Ramona Sidharta, Shane Utt
              Mrunal Patel Mrunal Patel
              None
              Avani Bhatt Avani Bhatt
              Eric Rich Eric Rich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: