Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: PyTorch
Labels:
- pytorch_qa

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Test Class: distributed/tensor/test_dtensor_testbase.py::DTensorTestBaseUtilCPUTest
Number of Failing Tests: 1
Platform: Distributed
Test Type: Unit Test
Error Pattern: single_issue

Summary:
Test in DTensorTestBaseUtilCPUTest is failing with process timeout and resource unavailability errors in distributed environment.

Version Information:

PyTorch Commit: 6bdd8c9
Branch: main
Test Date: 2026-01-13
Python Version: 3.12
Sprint: Sprint 24

Failure Pattern:
Test times out after 300 seconds and then fails with resource unavailability on retry

Common Error:

RuntimeError: Process 0 terminated or timed out after 300.08277130126953 seconds

On retry:
RuntimeError: Resource temporarily unavailable
  backend_class = ProcessGroupGloo(...)

Failing Tests:
1. test_dtensor_testbase_destroy_pg

Steps to Reproduce:
1. Run test:

   TEST_CONFIG=cpu python3 test/run_test.py -i distributed/tensor/test_dtensor_testbase
   TEST_CONFIG=distributed python3 test/run_test.py -i distributed/tensor/test_dtensor_testbase

2. Observe process timeout followed by resource unavailability on retry

Expected Result:
Test should complete within timeout and process groups should be properly destroyed

Actual Result:
Process 0 times out after 300 seconds, subsequent retry fails with resource unavailability when initializing ProcessGroupGloo

Root Cause Analysis:
The test is experiencing issues with:
1. Process group destruction taking too long or hanging
2. Resource exhaustion (possibly file descriptors or network ports) preventing ProcessGroupGloo initialization
3. Improper cleanup of previous test processes

Potential Solutions:
1. Increase timeout for process group operations
2. Ensure proper cleanup of resources between test runs
3. Investigate why ProcessGroupGloo initialization fails with resource unavailability
4. Check for leaked file descriptors or network connections

Priority: P2

mentioned on

Issue - [AIPCC-8929] DTensorTestBaseUtilCPUTest - PyTorch Test Failure

Assignee:: Unassigned

Reporter:: PyTorch Engineering

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2026/01/20 8:08 AM

Updated:: 2026/02/02 7:25 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty