XMLWordPrintable

    • Product / Portfolio Work
    • None
    • 100% To Do, 0% In Progress, 0% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • Tech Preview
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview (aka. Goal Summary)

      Enable user ability to select where OCI volume images can be pulled from, allowing additional artifact storage locations in CRI-O.

      Target: OpenShift 4.22 (Tech Preview)
      Primary Use Case: RHOAI - SSD-backed storage for large ML models
      Upstream Foundation: CRI-O v1.36 (April 2026) - PR #9702


      Problem Statement

      Large AI/ML models need to be stored on high-performance storage (SSD) separate from OS/container storage. Currently, CRI-O hardcodes all artifacts to /var/lib/containers/storage/artifacts/, preventing:

      • Using dedicated high-speed storage for frequently accessed artifacts
      • Pre-populating shared artifact caches across cluster nodes
      • Separating large artifacts from root filesystem space

      Impact: Slow pod startup times for AI/ML workloads, inefficient storage utilization.


      Solution Overview

      Enable configuration of additional artifact stores in CRI-O via OpenShift API. Users can:

      1. Configure multiple read-only artifact storage locations at node level
      2. Specify artifact stores backed by different storage media (SSD, NFS)
      3. Use OpenShift APIs to manage artifact storage declaratively
      4. Pre-populate artifact caches for faster pod initialization using tools like Podman

      Pattern: Follows proven additionalimagestores approach from containers/storage.


      Implementation Components

      1. Upstream CRI-O (Foundation)

      • PR #9702 in progress for CRI-O v1.36 (April 2026)
      • Adds additional_artifact_stores configuration field
      • Provides read-only additional stores support

      2. OpenShift Enhancement Proposal

      • Repository: openshift/enhancements
      • Deliverable: Approved enhancement document

      3. OpenShift API Extension

      • Repository: openshift/api
      • Extend ContainerRuntimeConfig with AdditionalArtifactStores field

      4. Machine Config Operator Implementation

      • Repository: openshift/machine-config-operator
      • Translate API config → CRI-O TOML config → MachineConfig

      5. Documentation

      • Issue: OSDOCS-17312
      • Tech Preview feature documentation with examples

      User Stories

      1. RHOAI Platform Operator: "I want to store large ML models on SSD storage so that model loading is faster and doesn't consume root filesystem space."
      2. Cluster Admin (Air-gapped): "I want to pre-populate artifact caches on nodes so that pods can start without pulling from external registries."
      3. Edge Deployment Operator: "I want to deliver artifacts via removable media (USB) so that edge nodes can operate offline."

      Acceptance Criteria

      • ContainerRuntimeConfig accepts additionalArtifactStores configuration
      • MCO generates correct CRI-O config from API
      • CRI-O resolves artifacts from additional stores (in order)
      • RHOAI team validates performance improvement with SSD storage
      • Documentation available for Tech Preview users
      • Feature behind TechPreviewNoUpgrade feature gate
      • No regressions for existing artifact storage behavior

      Dependencies

      Critical Path:

      • CRI-O v1.36 release (April 2026) with PR #9702 merged
      • Enhancement approval before API work begins
      • API merge before MCO implementation
      • RHOAI team for performance validation

      External:

      • Upstream CRI-O PR #9702
      • CRI-O v1.36 release schedule
      • Backport: If API work lands for 4.22, backport CRI-O v1.36 PR to v1.35

      Success Metrics

      1. Performance: Measurable improvement in ML model loading time with SSD storage
      2. Adoption: RHOAI team successfully deploys feature in testing
      3. Stability: No regressions in default artifact behavior
      4. Documentation: Clear examples for common scenarios

      Risks & Mitigation

      Risk Mitigation
      CRI-O v1.36 delayed PR #9702 in progress; maintain upstream communication
      Performance doesn't meet expectations Early testing with RHOAI; benchmark vs baseline
      API design requires upstream KEP Use OpenShift-specific API for TP; defer KEP
      Configuration complexity Simple API design; comprehensive documentation

      References

      • Upstream Issue: cri-o/cri-o#9570
      • Upstream PR: cri-o/cri-o#9702
      • Documentation: OSDOCS-17312
      • Related: RFE-8441 (artifact pre-loading - separate feature)
      • Stakeholders: RHOAI team (Luca Burgazzoli), Node team (Sascha Grunert)

      Out of Scope

      • Artifact pre-loading mechanisms (RFE-8441)
      • Write-capable artifact stores
      • Dynamic artifact mirroring
      • Upstream Kubernetes KEP (deferred post-TP)

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              None
              Qi Wang, Sascha Grunert
              Ryan Phillips Ryan Phillips
              Aruna Naik Aruna Naik
              Matthew Werner Matthew Werner
              Derrick Ornelas Derrick Ornelas
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: