Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-4051

CRI-O Additional Storage Support (Layer, Artifact, and Image Stores)

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • None
    • CRI-O Additional Storage Support (Layer, Artifact, and Image Stores)
    • In Progress
    • Product / Portfolio Work
    • OCPSTRAT-2623Additional Artifact Store - 4.22 -TP
    • 50% To Do, 33% In Progress, 17% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • None

      Goal

      Enable configuration of additional storage locations in CRI-O (layer stores, artifact stores, and image stores) to support lazy image pulling, high-performance artifact storage, and shared image caches. This addresses performance problems where large AI/ML workload images cause significant delays in container boot time (image pull operations account for ~70% of container startup time) and inefficient storage utilization across cluster nodes.

      Why is this important?

      • Large AI/ML model images (multi-GB) take minutes to pull completely, delaying container startup
      • Containers cannot start until 100% of the image is downloaded, impacting application availability and autoscaling responsiveness
      • Large AI/ML models need to be stored on high-performance storage (SSD) for faster access
      • Multiple cluster nodes redundantly pull identical container images from external registries, wasting network bandwidth
      • CRI-O currently lacks flexibility in storage configuration, preventing use of dedicated storage or shared caches
      • Cannot pre-populate shared caches across cluster nodes for air-gapped or edge deployments
      • Competitors (AWS Fargate/SOCI, AWS ECS) already offer lazy pulling capabilities
      • Root filesystem space consumed by large artifacts and duplicate images that should be on separate or shared storage

      Scenarios

      1. As an AI/ML Platform Operator, I want containers with large model images to start immediately without waiting for full image download, so that my applications are available faster and autoscaling is more responsive
      2. As a RHOAI Platform Operator, I want to store large ML models on SSD storage, so that model loading is faster and doesn't consume root filesystem space
      3. As a Cluster Admin, I want to pre-populate a read-only image cache on shared network storage (NFS), so that multiple nodes can share images without redundant pulls from external registries
      4. As an OpenShift Administrator, I want to configure lazy pulling and storage locations for specific workloads using declarative APIs, so that I can optimize pod startup times without manual node configuration
      5. As a Cluster Admin in an air-gapped environment, I want to pre-populate artifact caches and complete container images on nodes, so that pods can start without pulling from external registries
      6. As an Edge Deployment Operator, I want to deliver artifacts via removable media (USB) and use SSD-backed storage for frequently-used images, so that edge nodes can operate offline with fast container startup
      7. As an Application Developer, I want my containers to start quickly even with large images, so that my development iteration cycle is faster

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents
      • ContainerRuntimeConfig API accepts additionalLayerStores, additionalArtifactStores, and additionalImageStores configuration with path-based settings, FUSE filesystem interface, and graceful fallback
        • additionalLayerStores: path field for lazy pulling configuration (max 5 entries)
        • additionalArtifactStores: path and optional filter fields for artifact storage (max 10 entries)
        • additionalImageStores: path field for shared image cache configuration (max 10 entries)
      • MCO generates correct CRI-O configuration files (storage.conf) with all additional storage settings via MachineConfig
        • Single ContainerRuntimeConfig per pool (configurations merged to avoid overrides)
        • MachineConfig applied to matching node pools (requires node reboot)
      • CRI-O resolves image layers, artifacts, and images from additional stores in configured order
      • Lazy pulling works: containers start before full image download (requires registry HTTP range request support)
      • Measurable container startup time improvement for large images (>5GB)
      • RHOAI team validates measurable performance improvement with SSD storage for artifacts
      • Performance validation shows benefits of shared image caches reducing redundant pulls
      • Feature works with shared storage (NFS) and high-performance storage (SSD)
      • No regressions for standard image pulling and artifact storage behavior
      • Works with registries supporting HTTP range requests (Docker Hub, Quay, GitHub Container Registry)
      • Feature behind TechPreviewNoUpgrade feature gate for OpenShift 4.22
      • Clear documentation on setup, compatible storage plugins, customer installation procedures (BYOS approach), and storage configuration strategies

      Dependencies (internal and external)

      1. Critical Path: CRI-O v1.36 release (April 2026) with PR #9702 merged (for artifact stores)
      2. Upstream: container-libs/storage - Stabilize Additional Layer Store API (currently experimental, can change without major version bump)
      3. Upstream: container-libs/storage - additionalimagestores feature (already GA and stable)
      4. Upstream: CRI-O PR #9702 (https://github.com/cri-o/cri-o/pull/9702) - adds additional_artifact_stores configuration
      5. OpenShift: Enhancement proposal ✓ MERGED - PR #1934 (https://github.com/openshift/enhancements/pull/1934) - unified proposal covering all three storage types
      6. OpenShift: API merge in openshift/api (blocks MCO implementation)
      7. External: Registry HTTP range request support required for lazy pulling to function
      8. External: RHOAI team (Luca Burgazzoli) for performance validation
      9. Storage: Pre-populated image cache setup procedures for NFS/SSD storage
      10. Documentation: OSDOCS-10167 - Customer documentation for storage plugin installation (stargz-store, nydus-store)
      11. Documentation: OSDOCS-17312 - Tech Preview feature documentation
      12. Backport consideration: If API work lands for 4.22, may need to backport CRI-O v1.36 PR to v1.35

      Previous Work (Optional):

      1. OCPNODE-2204 - Previous attempt at lazy pulling via stargz-snapshotter (Stale/Abandoned)
      2. RHEL-66490 - Related bug: Image IDs inconsistent when using zstd:chunked images
      3. RFE-8441 - Related feature request for artifact pre-loading (separate feature, out of scope)
      4. Upstream stargz-store from containerd/stargz-snapshotter project (proven, mature)
      5. Proven pattern: containers/storage additionalimagestores approach (already GA and stable)

      Open questions:

      1. Can we ship Tech Preview on experimental API? (Additional Layer Store API is currently experimental - can change without major version bump)
      2. Timeline: Will CRI-O v1.36 be available in time for OpenShift 4.22? (April 2026 upstream release) - Do we need to backport PR #9702 to CRI-O v1.35?
      3. Customer support model: Should we provide community container images for storage plugins (stargz-store) to reduce customer burden with BYOS approach?
      4. Validation criteria: What are the specific performance benchmarks we should target with RHOAI validation for SSD storage?

      Done Checklist

              sgrunert@redhat.com Sascha Grunert
              sgrunert@redhat.com Sascha Grunert
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: