-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
Problem
24 flex attention tests with float16 strided inputs fail on H200 (Hopper architecture) due to CUDA memory misalignment errors.
Root Cause
The flex attention implementation uses strided memory access patterns that trigger misalignment on H200 Hopper architecture (sm_90) when using float16 data type.
Impact
- Tests failing: ~24
- Severity: Medium - Affects float16 precision training
- Production impact: Low - bfloat16 and float32 work fine
- Pass rate impact: Accounts for 11% of all failures (24/215)
Technical Details
- Different memory alignment requirements on Hopper architecture
- Strided tensor access patterns incompatible with sm_90 float16
- Issue specific to float16 (bfloat16 and float32 variants pass)
Current Workaround
Tests excluded in workflow configuration
Acceptance Criteria
- [ ] Report issue to PyTorch upstream team
- [ ] Fix memory alignment for strided float16 access on sm_90
- [ ] All 24 flex attention float16 tests pass on H200
References
- Workflow run: https://github.com/subinz1/pytorch/actions/runs/20745368086
- Documentation: 5_SHARD_FAILURE_SUMMARY.md