Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1740

[GA] Enabling AI Workloads with LeaderWorkerSet (LWS) API in OpenShift

XMLWordPrintable

    • Product / Portfolio Work
    • OCPSTRAT-1692AI Workloads for OpenShift
    • 0% To Do, 0% In Progress, 100% Done
    • Hide

      Date: 9/2/25

      Status Summary: Green

      GA 9/18 on track

       

       

      Show
      Date: 9/2/25 Status Summary: Green GA 9/18 on track    
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Summary:
      The LeaderWorkerSet (LWS) API is designed for deploying and managing groups of pods as a unified replication unit, known as a "super pod." This capability is especially suited for AI/ML inference workloads, where large language models (LLMs) and multi-host inference workflows require sharded models across multiple devices and nodes. The LWS API allows OpenShift to manage distributed inference workloads, where a single leader pod coordinates multiple worker pods, enabling streamlined orchestration for complex AI tasks with high compute and memory demands.

      Use Case:
      For AI workloads that require distributed inference—such as LLMs or deep learning models with sharding across devices—LWS provides a structured way to orchestrate model replicas with both leaders and workers in a defined topology. This feature enables OpenShift users to deploy sharded AI workloads where models are divided across multiple nodes, providing the flexibility, scalability, and fault tolerance necessary to process large-scale inference requests efficiently.

      https://github.com/kubernetes-sigs/lws 

      https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/llamacpp 

      https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm/GPU 

       

      Requirement for operator 

        • 1) disconnected 
        • 2) FIPS 
        • 3) Multi arch -> Arm 
        • 4) HCP -> ability to run operator in infra/worker node
        • 5) Konflux
        • 6) ability to deploy this operator in non openshift NS
        • 7) read only file system = true 
        • 8) network policy to prevent leak ( see commens section for this )

       

      Hypershift ROSA/ARO/OSD requirement -> for all operators

       

      1. operator can run on infra/worker node
      2. do not modify Machine config
      3. can be installed in non *openshift NS
      4. is build and tested via Konflux

       

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              None
              None
              Dave Gordon Dave Gordon
              Wen Wang Wen Wang
              Andrea Hoffer Andrea Hoffer
              Derrick Ornelas Derrick Ornelas
              Votes:
              1 Vote for this issue
              Watchers:
              24 Start watching this issue

                Created:
                Updated:
                Resolved: