Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8929

[QA][PyTorch UT][Distributed] distributed/tensor/test_dtensor_testbase.py - DTensorTestBaseUtilCPUTest failures

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      Test Class: distributed/tensor/test_dtensor_testbase.py::DTensorTestBaseUtilCPUTest
      Number of Failing Tests: 1
      Platform: Distributed
      Test Type: Unit Test
      Error Pattern: single_issue

      Summary:
      Test in DTensorTestBaseUtilCPUTest is failing with process timeout and resource unavailability errors in distributed environment.

      Version Information:

      • PyTorch Commit: 6bdd8c9
      • Branch: main
      • Test Date: 2026-01-13
      • Python Version: 3.12
      • Sprint: Sprint 24

      Failure Pattern:
      Test times out after 300 seconds and then fails with resource unavailability on retry

      Common Error:

      RuntimeError: Process 0 terminated or timed out after 300.08277130126953 seconds
      
      On retry:
      RuntimeError: Resource temporarily unavailable
        backend_class = ProcessGroupGloo(...)
      

      Failing Tests:
      1. test_dtensor_testbase_destroy_pg

      Steps to Reproduce:
      1. Run test:

         TEST_CONFIG=cpu python3 test/run_test.py -i distributed/tensor/test_dtensor_testbase
         TEST_CONFIG=distributed python3 test/run_test.py -i distributed/tensor/test_dtensor_testbase
         

      2. Observe process timeout followed by resource unavailability on retry

      Expected Result:
      Test should complete within timeout and process groups should be properly destroyed

      Actual Result:
      Process 0 times out after 300 seconds, subsequent retry fails with resource unavailability when initializing ProcessGroupGloo

      Root Cause Analysis:
      The test is experiencing issues with:
      1. Process group destruction taking too long or hanging
      2. Resource exhaustion (possibly file descriptors or network ports) preventing ProcessGroupGloo initialization
      3. Improper cleanup of previous test processes

      Potential Solutions:
      1. Increase timeout for process group operations
      2. Ensure proper cleanup of resources between test runs
      3. Investigate why ProcessGroupGloo initialization fails with resource unavailability
      4. Check for leaked file descriptors or network connections

      Priority: P2

              Unassigned Unassigned
              pytorch-engineering PyTorch Engineering
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: