Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Priority: Critical
Fix Version/s: openshift-4.20
Affects Version/s: None
Component/s: Applications & Workloads
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-1692AI Workloads for OpenShift
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Status Summary:

Hide

Date: 9/2/25

Status Summary: Green

GA 9/18 on track

Show
Date: 9/2/25 Status Summary: Green GA 9/18 on track
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:

openshift-4.20
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Summary:
The LeaderWorkerSet (LWS) API is designed for deploying and managing groups of pods as a unified replication unit, known as a "super pod." This capability is especially suited for AI/ML inference workloads, where large language models (LLMs) and multi-host inference workflows require sharded models across multiple devices and nodes. The LWS API allows OpenShift to manage distributed inference workloads, where a single leader pod coordinates multiple worker pods, enabling streamlined orchestration for complex AI tasks with high compute and memory demands.

Use Case:
For AI workloads that require distributed inference—such as LLMs or deep learning models with sharding across devices—LWS provides a structured way to orchestrate model replicas with both leaders and workers in a defined topology. This feature enables OpenShift users to deploy sharded AI workloads where models are divided across multiple nodes, providing the flexibility, scalability, and fault tolerance necessary to process large-scale inference requests efficiently.

https://github.com/kubernetes-sigs/lws

https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/llamacpp

https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm/GPU

Requirement for operator

- 1) disconnected
- 2) FIPS
- 3) Multi arch -> Arm
- 4) HCP -> ability to run operator in infra/worker node
- 5) Konflux
- 6) ability to deploy this operator in non openshift NS
- 7) read only file system = true
- ~~8) network policy to prevent leak~~ ( see commens section for this )

Hypershift ROSA/ARO/OSD requirement -> for all operators

operator can run on infra/worker node
do not modify Machine config
can be installed in non *openshift NS
is build and tested via Konflux

is related to

OCPSTRAT-1741 Kubernetes JobSet in OpenShift - Tech Preview

Closed

links to

openshift/kubernetes-sigs-jobset#20: <CARRY>: CNTRLPLANE-120: Add Prow and Konflux compatible Dockerfiles

openshift/kubernetes-sigs-jobset#21: <CARRY>: CNTRLPLANE-211: Revert Dockerfile to the upstream original

openshift/kubernetes-sigs-jobset#29: <CARRY>: CNTRLPLANE-211: Add renovate.json to disable go mod auto update PRs

openshift/kubernetes-sigs-lws#30: CARRY>: CNTRLPLANE-120: Add Prow Dockerfile and Update Konflux Dockerfile accordingly

openshift/kubernetes-sigs-lws#31: CNTRLPLANE-211: Add .snyk file to exclude unit tests and vendor directory

openshift/kubernetes-sigs-lws#32: CNTRLPLANE-211: Use rhel9 base image without ocp version tag

openshift/kubernetes-sigs-lws#33: CNTRLPLANE-211: Fix incorrect base image url in Dockerfile.ci

openshift/kubernetes-sigs-lws#36: CNTRLPLANE-211: Sync with upstream

openshift/kubernetes-sigs-lws#38: CNTRLPLANE-211: Realign Dockerfiles with upstream Dockerfile

openshift/kubernetes-sigs-lws#42: CNTRLPLANE-309: Sync downstream with the new changes from upstream

openshift/kubernetes-sigs-lws#43: CNTRLPLANE-309: <CARRY>: Use make build and add controller-gen binary

openshift/kubernetes-sigs-lws#44: Revert "CNTRLPLANE-309: Sync downstream with the new changes from upstream"

openshift/kubernetes-sigs-lws#45: CNTRLPLANE-309: Sync downstream with new

openshift/kubernetes-sigs-lws#46: CNTRLPLANE-234: Get new changes in downstream

openshift/kubernetes-sigs-lws#47: CNTRLPLANE-234: <CARRY> Enable upstream e2e tests by implementing ocp specific e2e-test-ocp.sh

openshift/kubernetes-sigs-lws#51: CNTRLPLANE-309: Bring upstream changes in downstream

openshift/kubernetes-sigs-lws#54: CNTRLPLANE-118: Add new lws-main Konflux application tekton files

openshift/kubernetes-sigs-lws#55: CNTRLPLANE-309: Bring upstream changes in downstream

openshift/kubernetes-sigs-lws#56: CNTRLPLANE-118: Re-enable coverity checks

openshift/lws-operator#5: CNTRLPLANE-115: Implement operator client interfaces

openshift/lws-operator#6: CNTRLPLANE-115: Use operator's namespace instead of hardcoded

openshift/lws-operator#7: CNTRLPLANE-115: Add lease cluster role for leader election

openshift/lws-operator#9: CNTRLPLANE-115: Add initial implementation of LWS Operator

openshift/lws-operator#10: CNTRLPLANE-120: Add verification script for generated manifests

openshift/lws-operator#11: CNTRLPLANE-211: Update Dockerfile.ci to align with Konflux Dockerfile

openshift/lws-operator#12: CNTRLPLANE-211: Modify GOFLAGS to use -mod=readonly instead of -mod=vendor

openshift/lws-operator#15: CNTRLPLANE-329: Introduce e2e testing in LWS Operator

openshift/lws-operator#16: CNTRLPLANE-196: Add bundle image and related manifests configurations

openshift/lws-operator#18: CNTRLPLANE-309: Update generated manifests with the latest LWS

openshift/lws-operator#27: CNTRLPLANE-118: Add tekton configurations

openshift/lws-operator#28: CNTRLPLANE-118: [release-4.19] Add tekton configurations

openshift/lws-operator#29: CNTRLPLANE-118: Update version and channels to 1.0.0 and stable

openshift/lws-operator#30: [release-4.19] CNTRLPLANE-309: Update generated manifests with the latest LWS

openshift/lws-operator#31: [release-4.19] CNTRLPLANE-118: Update version and channels to 1.0.0 and stable

openshift/openshift-docs#98450: OSDOCS#15493: New AI workloads book and LWS docs

openshift/openshift-docs#98846: OSDOCS#15493: Adding LWS to RN list

openshift/openshift-docs#98847: OSDOCS#15493: Adding LWS to RN list

openshift/openshift-docs#98848: OSDOCS#15493: Adding LWS to RN list

openshift/release#60793: CNTRLPLANE-115: Add required configurations for LWS operator

openshift/release#61443: WIP: CNTRLPLANE-211: Add configuration for kubernetes-sigs-lws repository

openshift/release#61493: CNTRLPLANE-211: Use correct rhel9 base image tag

openshift/release#61550: CNTRLPLANE-211: Fix incorrect base image url of lws-operator

openshift/release#61576: CNTRLPLANE-211: Fix base image name

openshift/release#62941: CNTRLPLANE-309: Add verify-ocp execution in verify step in lws

openshift/release#62979: CNTRLPLANE-234: Add e2e testing in kubernetes-sigs-lws repository

openshift/release#63512: WIP: CNTRLPLANE-329: Add e2e-aws-operator test in lws-operator

openshift/release#63993: CNTRLPLANE-329: Add e2e-aws-operand e2e test in lws-operator

(43 links to)

Assignee:: Gaurav Singh

Reporter:: Gaurav Singh

Need Info From:: None

Contributors:: None

Architect:: Dave Gordon

QA Contact:: Wen Wang

Doc Contact:: Andrea Hoffer

Product Operations Engineering Contact:: Derrick Ornelas

Votes:: 1 Vote for this issue

Watchers:: 24 Start watching this issue

Created:: 2024/10/30 5:57 PM

Updated:: 2025/10/21 9:51 PM

Resolved:: 2025/09/18 7:19 PM

Target end:: 2025/07/24