Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8516

[PyTorch][Upstream CI] Fix Flex Attention Float16 Alignment Issues on H200

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      Problem

      24 flex attention tests with float16 strided inputs fail on H200 (Hopper architecture) due to CUDA memory misalignment errors.

      Root Cause

      The flex attention implementation uses strided memory access patterns that trigger misalignment on H200 Hopper architecture (sm_90) when using float16 data type.

      Impact

      • Tests failing: ~24
      • Severity: Medium - Affects float16 precision training
      • Production impact: Low - bfloat16 and float32 work fine
      • Pass rate impact: Accounts for 11% of all failures (24/215)

      Technical Details

      • Different memory alignment requirements on Hopper architecture
      • Strided tensor access patterns incompatible with sm_90 float16
      • Issue specific to float16 (bfloat16 and float32 variants pass)

      Current Workaround

      Tests excluded in workflow configuration

      Acceptance Criteria

      • [ ] Report issue to PyTorch upstream team
      • [ ] Fix memory alignment for strided float16 access on sm_90
      • [ ] All 24 flex attention float16 tests pass on H200

      References

              rh-ee-sugeorge Subin George
              rh-ee-sugeorge Subin George
              PyTorch Infrastructure
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: