Loading...

Type: Epic
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Epic Name:
CRI-O Additional Storage Support (Layer, Artifact, and Image Stores)
Epic Status:
In Progress
Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-2623Additional Artifact Store - 4.22 -TP
Hierarchy Progress Bar:

50% To Do, 33% In Progress, 17% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
None

Target Version:

openshift-4.22
Release Blocker:
None

Goal

Enable configuration of additional storage locations in CRI-O (layer stores, artifact stores, and image stores) to support lazy image pulling, high-performance artifact storage, and shared image caches. This addresses performance problems where large AI/ML workload images cause significant delays in container boot time (image pull operations account for ~70% of container startup time) and inefficient storage utilization across cluster nodes.

Why is this important?

Large AI/ML model images (multi-GB) take minutes to pull completely, delaying container startup
Containers cannot start until 100% of the image is downloaded, impacting application availability and autoscaling responsiveness
Large AI/ML models need to be stored on high-performance storage (SSD) for faster access
Multiple cluster nodes redundantly pull identical container images from external registries, wasting network bandwidth
CRI-O currently lacks flexibility in storage configuration, preventing use of dedicated storage or shared caches
Cannot pre-populate shared caches across cluster nodes for air-gapped or edge deployments
Competitors (AWS Fargate/SOCI, AWS ECS) already offer lazy pulling capabilities
Root filesystem space consumed by large artifacts and duplicate images that should be on separate or shared storage

Scenarios

As an AI/ML Platform Operator, I want containers with large model images to start immediately without waiting for full image download, so that my applications are available faster and autoscaling is more responsive
As a RHOAI Platform Operator, I want to store large ML models on SSD storage, so that model loading is faster and doesn't consume root filesystem space
As a Cluster Admin, I want to pre-populate a read-only image cache on shared network storage (NFS), so that multiple nodes can share images without redundant pulls from external registries
As an OpenShift Administrator, I want to configure lazy pulling and storage locations for specific workloads using declarative APIs, so that I can optimize pod startup times without manual node configuration
As a Cluster Admin in an air-gapped environment, I want to pre-populate artifact caches and complete container images on nodes, so that pods can start without pulling from external registries
As an Edge Deployment Operator, I want to deliver artifacts via removable media (USB) and use SSD-backed storage for frequently-used images, so that edge nodes can operate offline with fast container startup
As an Application Developer, I want my containers to start quickly even with large images, so that my development iteration cycle is faster

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents
ContainerRuntimeConfig API accepts additionalLayerStores, additionalArtifactStores, and additionalImageStores configuration with path-based settings, FUSE filesystem interface, and graceful fallback
- additionalLayerStores: path field for lazy pulling configuration (max 5 entries)
- additionalArtifactStores: path and optional filter fields for artifact storage (max 10 entries)
- additionalImageStores: path field for shared image cache configuration (max 10 entries)
MCO generates correct CRI-O configuration files (storage.conf) with all additional storage settings via MachineConfig
- Single ContainerRuntimeConfig per pool (configurations merged to avoid overrides)
- MachineConfig applied to matching node pools (requires node reboot)
CRI-O resolves image layers, artifacts, and images from additional stores in configured order
Lazy pulling works: containers start before full image download (requires registry HTTP range request support)
Measurable container startup time improvement for large images (>5GB)
RHOAI team validates measurable performance improvement with SSD storage for artifacts
Performance validation shows benefits of shared image caches reducing redundant pulls
Feature works with shared storage (NFS) and high-performance storage (SSD)
No regressions for standard image pulling and artifact storage behavior
Works with registries supporting HTTP range requests (Docker Hub, Quay, GitHub Container Registry)
Feature behind TechPreviewNoUpgrade feature gate for OpenShift 4.22
Clear documentation on setup, compatible storage plugins, customer installation procedures (BYOS approach), and storage configuration strategies

Dependencies (internal and external)

Critical Path: CRI-O v1.36 release (April 2026) with PR #9702 merged (for artifact stores)
Upstream: container-libs/storage - Stabilize Additional Layer Store API (currently experimental, can change without major version bump)
Upstream: container-libs/storage - additionalimagestores feature (already GA and stable)
Upstream: CRI-O PR #9702 (https://github.com/cri-o/cri-o/pull/9702) - adds additional_artifact_stores configuration
OpenShift: Enhancement proposal ✓ MERGED - PR #1934 (https://github.com/openshift/enhancements/pull/1934) - unified proposal covering all three storage types
OpenShift: API merge in openshift/api (blocks MCO implementation)
External: Registry HTTP range request support required for lazy pulling to function
External: RHOAI team (Luca Burgazzoli) for performance validation
Storage: Pre-populated image cache setup procedures for NFS/SSD storage
Documentation: OSDOCS-10167 - Customer documentation for storage plugin installation (stargz-store, nydus-store)
Documentation: OSDOCS-17312 - Tech Preview feature documentation
Backport consideration: If API work lands for 4.22, may need to backport CRI-O v1.36 PR to v1.35

Previous Work (Optional):

~~OCPNODE-2204~~ - Previous attempt at lazy pulling via stargz-snapshotter (Stale/Abandoned)
- Enhancement Proposal: https://github.com/openshift/enhancements/pull/1600
- MCO PR: https://github.com/openshift/machine-config-operator/pull/4248
- Artifacts may be useful for reference
~~RHEL-66490~~ - Related bug: Image IDs inconsistent when using zstd:chunked images
RFE-8441 - Related feature request for artifact pre-loading (separate feature, out of scope)
Upstream stargz-store from containerd/stargz-snapshotter project (proven, mature)
Proven pattern: containers/storage additionalimagestores approach (already GA and stable)

Open questions:

Can we ship Tech Preview on experimental API? (Additional Layer Store API is currently experimental - can change without major version bump)
Timeline: Will CRI-O v1.36 be available in time for OpenShift 4.22? (April 2026 upstream release) - Do we need to backport PR #9702 to CRI-O v1.35?
Customer support model: Should we provide community container images for storage plugins (stargz-store) to reduce customer burden with BYOS approach?
Validation criteria: What are the specific performance benchmarks we should target with RHOAI validation for SSD storage?

Done Checklist

DEV - Enhancement proposal: ✓ DONE - ~~OCPNODE-4052~~ - PR #1934 (https://github.com/openshift/enhancements/pull/1934) - Merged Feb 4, 2026
DEV - Upstream code merged - Additional Artifact Store support: https://github.com/cri-o/cri-o/pull/9702
DEV - API changes merged: OCPNODE-4060
DEV - MCO implementation merged: OCPNODE-4074
CI - CI implementation: OCPNODE-4054
QE - e2e testing automation: OCPNODE-4055
QE - Pre-merge testing: OCPNODE-4056
Release Technical Enablement
DEV - Downstream build attached to advisory

is related to

OCPSTRAT-2623 Additional Artifact Store - 4.22 -TP

In Progress

OCPSTRAT-1285 Speeding Up Pulling Container Images(Tech Preview)

In Progress

links to

openshift/api#2681: OCPNODE-4060: Add additional storage configuration fields to ContainerRuntimeConfig

openshift/enhancements#1934: OCPNODE-4052: Add enhancement for additional storage configuration in CRI-O

Details

Description

Goal

Why is this important?

Scenarios

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions:

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates