-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
Test Class: distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard1DTrainingCore
Number of Failing Tests: 1
Platform: Distributed
Test Type: Unit Test
Error Pattern: single_issue
Summary:
Test in TestFullyShard1DTrainingCore is failing with segmentation fault in distributed multi-GPU training environment.
Version Information:
- PyTorch Commit: 6bdd8c9
- Branch: main
- Test Date: 2026-01-13
- Python Version: 3.12
- Sprint: Sprint 24
Failure Pattern:
Test failing with segmentation fault in libgomp during distributed training
Common Error:
RuntimeError: Process 3 exited with error code 10 and exception:
Segmentation fault (Address not mapped to object [(nil)])
Exception raised from ncclCommInitRank at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:7353
Failing Tests:
1. test_train_parity_multi_group
Steps to Reproduce:
1. Run test:
TEST_CONFIG=distributed python3 test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training
2. Observe segmentation fault in process 3
Expected Result:
Test should complete successfully across all distributed processes
Actual Result:
Process 3 crashes with segmentation fault, likely in OpenMP library (libgomp) during NCCL communicator initialization
Root Cause Analysis:
The segmentation fault occurs during:
1. NCCL communicator initialization (ncclCommInitRank)
2. Related to null pointer dereference in libgomp (Address not mapped: (nil))
3. Multi-group training setup may be exposing race condition or memory issue
Potential Solutions:
1. Investigate NCCL and OpenMP compatibility issues
2. Check for race conditions in multi-group FSDP initialization
3. Verify proper memory allocation for NCCL communicators
4. Test with different OpenMP thread settings
5. Review recent changes to FSDP multi-group support
Priority: P2