Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-8562

llm-d: Enable Advanced Traffic Management & DR for llm-d via OpenShift GIE (Rollouts, Failover, A/B)

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • service-mesh
    • None
    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary:

      Enable Advanced Traffic Management & DR for llm-d via OpenShift GIE (Rollouts, Failover, A/B)

      Description:

      The AI Product team requires advanced traffic shaping capabilities for the llm-d inference platform. We request the OpenShift team to configure the Global Ingress Engine (GIE) to support robust operations, moving us away from basic service exposure to a managed traffic architecture.

      The goal is to leverage GIE to achieve four key operational capabilities:

      1. Progressive Rollout Management: Ability to perform Canary deployments (e.g., shift 1%, then 5%, then 20% of traffic) rather than "Big Bang" updates.
      1. Automated Failover: Intelligent detection of unhealthy llm-d pods or clusters with instant traffic rerouting.
      1. A/B Testing Integration: Header-based routing to support experimenting with new model versions against control groups.
      1. Disaster Recovery (DR): A defined GIE policy for multi-region or multi-cluster failover in the event of a total site outage.

      User Story:

      As the LLM-D Technical Architect,

      I want the OpenShift GIE layer to manage ingress traffic dynamically based on health metrics, headers, and weight rules,

      So that I can safely deploy new models, test experimental features on a subset of users, and guarantee service continuity during infrastructure outages.

      Acceptance Criteria:

      1. Intelligent Rollout (Canary/Blue-Green)

      • [ ] GIE is configured to support weighted traffic splitting between two distinct llm-d release channels (e.g., stable vs. canary).
      • [ ] Traffic weights can be adjusted dynamically via configuration/API without downtime (e.g., Shift traffic 90/10 -> 50/50).

      2. A/B Testing (Header-Based Routing)

      • [ ] GIE inspects incoming HTTP/gRPC headers.
      • [ ] Requests containing a specific header (e.g., x-model-variant: experiment-v2) are deterministically routed to a specific backend service, bypassing the default load balancer logic.

      3. Automated Failover (Health-Aware)

      • [ ] GIE is integrated with llm-d health probes (Liveness/Readiness).
      • [ ] Upon detecting a 5xx error rate spike or latency degradation > 500ms on the primary backend, GIE automatically reroutes traffic to the healthy standby pool/cluster.

      4. Disaster Recovery (Multi-Site)

      • [ ] Defined Failover Policy: In the event of a "Site A" outage, GIE automatically redirects 100% of global traffic to "Site B" (or the DR cluster).
      • [ ] RTO (Recovery Time Objective) for ingress switching is verified to be under 2 minutes.

      Technical Constraints & Notes for OpenShift Team:

      • Protocols: llm-d utilizes both HTTP/REST and gRPC. GIE must support gRPC load balancing and trailer headers.
      • Sticky Sessions: If possible, configure session affinity based on user-id to maximize local KV-cache hit rates on backend nodes (optional, but preferred).

       

              jlongmui@redhat.com Jamie Longmuir
              naisingh@redhat.com Naina Singh
              None
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                None
                None