Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Networking, ocp-mcp-server
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-2841OpenShift.Next Applied/Agentic AI Experience
Hierarchy Progress Bar:

33% To Do, 67% In Progress, 0% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:
None
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

This feature introduces Model Context Protocol (MCP) integration into OpenShift Networking to enable agentic, AI-driven troubleshooting for Ingress, Gateway API, and DNS-related failures.

By exposing structured, real-time networking context (configuration, topology, policies, and telemetry) through MCP servers, OpenShift enables AI agents to analyze a live cluster's networking state, correlate signals across layers, and guide operators through faster root-cause analysis and remediation.

The goal is to transform OpenShift Ingress and DNS troubleshooting from a manual, log-driven, expert-only process into an assisted, explainable, and repeatable workflow powered by agentic AI, without replacing existing observability tools or requiring Service Mesh adoption.

Goals (aka. expected user outcomes)

Enable AI agents to reason over OpenShift Ingress, Gateway API, and DNS state using MCP as a standardized context interface
Reduce mean time to detection (MTTD) and mean time to resolution (MTTR) for networking-related application outages
Provide explainable, step-by-step troubleshooting guidance instead of opaque “black-box” recommendations
Leverage existing OpenShift Networking components (Ingress Operator, DNS Operator, Gateway API) rather than introducing parallel systems
Align with upstream MCP to avoid vendor lock-in and encourage extensibility

Requirements (aka. Acceptance Criteria):

Functional Requirements

Expose Ingress, Gateway API, and DNS networking context via one or more MCP servers, including:
- Resource configuration (Ingress, Gateway, HTTPRoute, DNS, Services, Endpoints)
- Traffic status and failure signals (health checks, routing status, error codes
Support read-only MCP access for troubleshooting and diagnostics (no direct mutation)
Enable correlation across layers (DNS → Ingress/Gateway → Service → Pod)
Provide RBAC-aware context exposure, ensuring agents only see data permitted to the requesting user
Integrate cleanly with existing OpenShift AI / agent frameworks (e.g., OpenShift AI, external MCP-compatible agents)

Non-Functional Requirements

Minimal performance overhead on control plane and data plane components
Secure, auditable access to MCP endpoints
No requirement for Service Mesh, LLM deployment, or external SaaS
Compatible with disconnected and regulated environments

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both
Classic (standalone cluster)
Hosted control planes
Multi node, Compact (three node), or Single node (SNO), or all
Connected / Restricted Network
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)
Operator compatibility
Backport needed (list applicable versions)
UI need (e.g. OpenShift Console, dynamic plugin, OCM)
Other (please specify)

Use Cases:

**TO BE REFINED**

1. Ingress Traffic Failure Diagnosis

Example: An operator asks an AI agent why external traffic is failing for an application.
The agent uses MCP to:

Inspect Ingress/Gateway configuration
Validate DNS records and resolution paths
Check backend service endpoints and readiness
Identify misconfigurations (TLS mismatch, route conflict, policy block)
Explain the failure and recommend corrective actions

2. Intermittent DNS Resolution Issues

Example: An application experiences sporadic name resolution failures.
The agent correlates:

DNS Operator state and logs
Pod-level DNS configuration
Network policy impacts
Node-local DNS cache behavior
and produces a human-readable diagnosis with supporting evidence.

3. Gateway API Route Debugging

Example: A platform engineer investigates why an HTTPRoute is not receiving traffic.
The agent:

Traverses Gateway → Listener → Route → Service bindings
Identifies invalid references or unmet constraints
Explains why the route is rejected or inactive

Out of Scope

Background

Troubleshooting OpenShift networking, and particularly Ingress, Gateway API, and DNS, requires deep domain knowledge and manual correlation across multiple APIs, operators, logs, and metrics. While OpenShift provides strong observability and diagnostics, the cognitive burden is still high, and worsens with cluster growth.

Model Context Protocol (MCP) provides a standardized way to expose structured system context to AI agents, enabling reasoning over live infrastructure state. By integrating MCP into OpenShift Networking, customer and internal-use can enable agentic workflows that understand Kubernetes networking semantics natively, rather than relying on less sophisticated tooling or procedure.

Customer Considerations

Security & Trust:
Customers must retain full control over what data is exposed to AI agents. MCP endpoints must respect OpenShift RBAC, tenancy, and audit requirements.
Explainability:
Customers expect AI-assisted troubleshooting to be transparent and actionable, not a “magic answer.” Clear reasoning paths and evidence are critical.
Operational Simplicity:
The feature should work out-of-the-box with existing OpenShift installations, without requiring customers to deploy or manage LLM infrastructure.
Regulated & Disconnected Environments:
The solution must function in air-gapped, on-prem, and regulated environments, with no dependency on external AI services.
Incremental Adoption:
Customers should be able to adopt MCP-based troubleshooting alongside existing tools (must-gather, logs, metrics) rather than replacing them.