Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8514

[PyTorch][Upstream CI] Test Coverage Analysis and Failure Documentation

    • Icon: Task Task
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      Objective

      Analyze test results, identify failures, and document all issues for RHEL 9.6 PyTorch build.

      Work Completed

      • Analyzed 5-shard test results (~20,400 tests)
      • Identified and categorized all 215 test failures
      • Created comprehensive failure analysis reports
      • Compared RHEL vs Ubuntu CI performance and pass rates
      • Documented root causes for each failure category

      Test Coverage Results

      Total tests: ~20,400
      Passed: ~20,185
      Failed: ~215
      Pass rate: 99.0%

      Failure Breakdown

      • CUTLASS Backend: ~189 tests (missing library)
      • Flex Attention: ~24 tests (H200 float16 alignment)
      • cuDNN JIT: 1 test (compilation issue)
      • RNN Flat Weights: 1 test (parameter handling)

      Documentation Created

      • 5_SHARD_FAILURE_SUMMARY.md - Detailed failure analysis
      • 5_SHARD_ANALYSIS_REPORT.md - Performance analysis
      • TEST_FAILURE_REPORT.md - Initial failure investigation
      • RHEL_VS_UBUNTU_COMPARISON.md - Platform comparison

      Deliverables

      • [x] Complete test failure analysis
      • [x] Root cause identification for all failures
      • [x] Comprehensive documentation
      • [x] Comparison with Ubuntu baseline

              rh-ee-sugeorge Subin George
              rh-ee-sugeorge Subin George
              PyTorch Infrastructure
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: