Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: PyTorch
Labels:
- pytorch_qa

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Test Class: distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard1DTrainingCore
Number of Failing Tests: 1
Platform: Distributed
Test Type: Unit Test
Error Pattern: single_issue

Summary:
Test in TestFullyShard1DTrainingCore is failing with segmentation fault in distributed multi-GPU training environment.

Version Information:

PyTorch Commit: 6bdd8c9
Branch: main
Test Date: 2026-01-13
Python Version: 3.12
Sprint: Sprint 24

Failure Pattern:
Test failing with segmentation fault in libgomp during distributed training

Common Error:

RuntimeError: Process 3 exited with error code 10 and exception:
Segmentation fault (Address not mapped to object [(nil)])

Exception raised from ncclCommInitRank at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:7353

Failing Tests:
1. test_train_parity_multi_group

Steps to Reproduce:
1. Run test:

   TEST_CONFIG=distributed python3 test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training

2. Observe segmentation fault in process 3

Expected Result:
Test should complete successfully across all distributed processes

Actual Result:
Process 3 crashes with segmentation fault, likely in OpenMP library (libgomp) during NCCL communicator initialization

Root Cause Analysis:
The segmentation fault occurs during:
1. NCCL communicator initialization (ncclCommInitRank)
2. Related to null pointer dereference in libgomp (Address not mapped: (nil))
3. Multi-group training setup may be exposing race condition or memory issue

Potential Solutions:
1. Investigate NCCL and OpenMP compatibility issues
2. Check for race conditions in multi-group FSDP initialization
3. Verify proper memory allocation for NCCL communicators
4. Test with different OpenMP thread settings
5. Review recent changes to FSDP multi-group support

Priority: P2

mentioned on

Issue - [AIPCC-8931] TestFullyShard1DTrainingCore - PyTorch Test Failure

Assignee:: Unassigned

Reporter:: PyTorch Engineering

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2026/01/20 8:08 AM

Updated:: 2026/02/02 7:25 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty