Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8512

[PyTorch][Upstream CI] Implement Test Infrastructure with Self-Hosted Runner

    • Icon: Task Task
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      Objective

      Set up test infrastructure using self-hosted runner with H200 GPU.

      Work Completed

      • Configured self-hosted runner: test-runner-git-109-vpc
      • Set up Podman container runtime (Docker emulation)
      • Configured GPU passthrough for CUDA tests
      • Implemented test sharding (initially 3 shards, optimized to 5)
      • Added test artifact collection and reporting
      • Set up --keep-going flag for comprehensive coverage

      Test Execution

      • Full PyTorch test suite: ~20,400 tests
      • Test shards: 5 (optimized from 3)
      • Parallel execution within shards
      • Comprehensive logging and artifact collection

      Deliverables

      • [x] Self-hosted runner operational
      • [x] GPU passthrough working
      • [x] Test execution reliable
      • [x] Artifact collection functional

      References

      • Workflow: .github/workflows/rhel-build-test.yml
      • Test script: .ci/pytorch/test.sh

              rh-ee-sugeorge Subin George
              rh-ee-sugeorge Subin George
              PyTorch Infrastructure
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: