Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8931

[QA][PyTorch UT][Distributed] distributed/_composable/fsdp/test_fully_shard_training.py - TestFullyShard1DTrainingCore failures

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      Test Class: distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShard1DTrainingCore
      Number of Failing Tests: 1
      Platform: Distributed
      Test Type: Unit Test
      Error Pattern: single_issue

      Summary:
      Test in TestFullyShard1DTrainingCore is failing with segmentation fault in distributed multi-GPU training environment.

      Version Information:

      • PyTorch Commit: 6bdd8c9
      • Branch: main
      • Test Date: 2026-01-13
      • Python Version: 3.12
      • Sprint: Sprint 24

      Failure Pattern:
      Test failing with segmentation fault in libgomp during distributed training

      Common Error:

      RuntimeError: Process 3 exited with error code 10 and exception:
      Segmentation fault (Address not mapped to object [(nil)])
      
      Exception raised from ncclCommInitRank at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:7353
      

      Failing Tests:
      1. test_train_parity_multi_group

      Steps to Reproduce:
      1. Run test:

         TEST_CONFIG=distributed python3 test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training
         

      2. Observe segmentation fault in process 3

      Expected Result:
      Test should complete successfully across all distributed processes

      Actual Result:
      Process 3 crashes with segmentation fault, likely in OpenMP library (libgomp) during NCCL communicator initialization

      Root Cause Analysis:
      The segmentation fault occurs during:
      1. NCCL communicator initialization (ncclCommInitRank)
      2. Related to null pointer dereference in libgomp (Address not mapped: (nil))
      3. Multi-group training setup may be exposing race condition or memory issue

      Potential Solutions:
      1. Investigate NCCL and OpenMP compatibility issues
      2. Check for race conditions in multi-group FSDP initialization
      3. Verify proper memory allocation for NCCL communicators
      4. Test with different OpenMP thread settings
      5. Review recent changes to FSDP multi-group support

      Priority: P2

              Unassigned Unassigned
              pytorch-engineering PyTorch Engineering
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: