Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2811

Integrate Model Context Protocol for Agentic AI-driven Ingress and DNS Troubleshooting

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • 100% To Do, 0% In Progress, 0% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview (aka. Goal Summary)  

      This feature introduces Model Context Protocol (MCP) integration into OpenShift Networking to enable agentic, AI-driven troubleshooting for Ingress, Gateway API, and DNS-related failures.

      By exposing structured, real-time networking context (configuration, topology, policies, and telemetry) through MCP servers, OpenShift enables AI agents to analyze a live cluster's networking state, correlate signals across layers, and guide operators through faster root-cause analysis and remediation.

      The goal is to transform OpenShift Ingress and DNS troubleshooting from a manual, log-driven, expert-only process into an assisted, explainable, and repeatable workflow powered by agentic AI, without replacing existing observability tools or requiring Service Mesh adoption.

      Goals (aka. expected user outcomes)

      • Enable AI agents to reason over OpenShift Ingress, Gateway API, and DNS state using MCP as a standardized context interface
      • Reduce mean time to detection (MTTD) and mean time to resolution (MTTR) for networking-related application outages
      • Provide explainable, step-by-step troubleshooting guidance instead of opaque “black-box” recommendations
      • Leverage existing OpenShift Networking components (Ingress Operator, DNS Operator, Gateway API) rather than introducing parallel systems
      • Align with upstream MCP to avoid vendor lock-in and encourage extensibility

      Requirements (aka. Acceptance Criteria):

      Functional Requirements

      • Expose Ingress, Gateway API, and DNS networking context via one or more MCP servers, including:
        • Resource configuration (Ingress, Gateway, HTTPRoute, DNS, Services, Endpoints)
        • Traffic status and failure signals (health checks, routing status, error codes
      • Support read-only MCP access for troubleshooting and diagnostics (no direct mutation)
      • Enable correlation across layers (DNS → Ingress/Gateway → Service → Pod)
      • Provide RBAC-aware context exposure, ensuring agents only see data permitted to the requesting user
      • Integrate cleanly with existing OpenShift AI / agent frameworks (e.g., OpenShift AI, external MCP-compatible agents)

      Non-Functional Requirements

      • Minimal performance overhead on control plane and data plane components
      • Secure, auditable access to MCP endpoints
      • No requirement for Service Mesh, LLM deployment, or external SaaS
      • Compatible with disconnected and regulated environments

       

      Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed.  Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

      Deployment considerations List applicable specific needs (N/A = not applicable)
      Self-managed, managed, or both  
      Classic (standalone cluster)  
      Hosted control planes  
      Multi node, Compact (three node), or Single node (SNO), or all  
      Connected / Restricted Network  
      Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)  
      Operator compatibility  
      Backport needed (list applicable versions)  
      UI need (e.g. OpenShift Console, dynamic plugin, OCM)  
      Other (please specify)  

      Use Cases:

      **TO BE REFINED**

      1. Ingress Traffic Failure Diagnosis

      Example:  An operator asks an AI agent why external traffic is failing for an application.
      The agent uses MCP to:

      • Inspect Ingress/Gateway configuration
      • Validate DNS records and resolution paths
      • Check backend service endpoints and readiness
      • Identify misconfigurations (TLS mismatch, route conflict, policy block)
      • Explain the failure and recommend corrective actions

      2. Intermittent DNS Resolution Issues

      Example:  An application experiences sporadic name resolution failures.
      The agent correlates:

      • DNS Operator state and logs
      • Pod-level DNS configuration
      • Network policy impacts
      • Node-local DNS cache behavior
        and produces a human-readable diagnosis with supporting evidence.

      3. Gateway API Route Debugging

      Example:  A platform engineer investigates why an HTTPRoute is not receiving traffic.
      The agent:

      • Traverses Gateway → Listener → Route → Service bindings
      • Identifies invalid references or unmet constraints
      • Explains why the route is rejected or inactive

      Out of Scope

      •  

      Background

      Troubleshooting OpenShift networking, and particularly Ingress, Gateway API, and DNS, requires deep domain knowledge and manual correlation across multiple APIs, operators, logs, and metrics. While OpenShift provides strong observability and diagnostics, the cognitive burden is still high, and worsens with cluster growth.

      Model Context Protocol (MCP) provides a standardized way to expose structured system context to AI agents, enabling reasoning over live infrastructure state. By integrating MCP into OpenShift Networking, customer and internal-use can enable agentic workflows that understand Kubernetes networking semantics natively, rather than relying on less sophisticated tooling or procedure.

      Customer Considerations

      • Security & Trust:
        Customers must retain full control over what data is exposed to AI agents. MCP endpoints must respect OpenShift RBAC, tenancy, and audit requirements.
      • Explainability:
        Customers expect AI-assisted troubleshooting to be transparent and actionable, not a “magic answer.” Clear reasoning paths and evidence are critical.
      • Operational Simplicity:
        The feature should work out-of-the-box with existing OpenShift installations, without requiring customers to deploy or manage LLM infrastructure.
      • Regulated & Disconnected Environments:
        The solution must function in air-gapped, on-prem, and regulated environments, with no dependency on external AI services.
      • Incremental Adoption:
        Customers should be able to adopt MCP-based troubleshooting alongside existing tools (must-gather, logs, metrics) rather than replacing them.

      Documentation Considerations

      •  

      Interoperability Considerations

      •  

              mcurry@redhat.com Marc Curry
              mcurry@redhat.com Marc Curry
              None
              None
              Ben Bennett Ben Bennett
              Aniket Bhat Aniket Bhat
              Avani Bhatt Avani Bhatt
              Eric Rich Eric Rich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: