Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1740

Enabling AI Workloads with LeaderWorkerSet (LWS) API in OpenShift

XMLWordPrintable

    • Strategic Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-1692AI Workloads for OpenShift
    • 25% To Do, 75% In Progress, 0% Done
    • 0

      Feature Summary:
      The LeaderWorkerSet (LWS) API is designed for deploying and managing groups of pods as a unified replication unit, known as a "super pod." This capability is especially suited for AI/ML inference workloads, where large language models (LLMs) and multi-host inference workflows require sharded models across multiple devices and nodes. The LWS API allows OpenShift to manage distributed inference workloads, where a single leader pod coordinates multiple worker pods, enabling streamlined orchestration for complex AI tasks with high compute and memory demands.

      Use Case:
      For AI workloads that require distributed inference—such as LLMs or deep learning models with sharding across devices—LWS provides a structured way to orchestrate model replicas with both leaders and workers in a defined topology. This feature enables OpenShift users to deploy sharded AI workloads where models are divided across multiple nodes, providing the flexibility, scalability, and fault tolerance necessary to process large-scale inference requests efficiently.

      https://github.com/kubernetes-sigs/lws 

      https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/llamacpp 

      https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm/GPU 

       

      Requirement for operator 

        • 1) disconnected 
        • 2) FIPS 
        • 3) Multi arch -> Arm 
        • 4) HCP -> ability to run operator in infra/worker node
        • 5) Konflux
        • 6) ability to deploy this operator in non openshift NS

       

      Hypershift ROSA/ARO/OSD requirement -> for all operators

       

      1. operator can run on infra/worker node
      2. do not modify Machine config
      3. can be installed in non *openshift NS
      4. is build and tested via Konflux

       

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              Ying Zhou Ying Zhou
              Andrea Hoffer Andrea Hoffer
              Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

                Created:
                Updated: