Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8932

[QA][PyTorch UT][CPU] test_foreach.py - TestForeachCUDA failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      Description

      Summary:
      TestForeachCUDA test class is failing during PyTorch CPU unit test execution with tensor comparison error.

      Test Class: test_foreach.py::TestForeachCUDA
      Number of Failing Tests: 1
      Platform: CPU
      Test Type: Unit Test

      Version Information:

      • PyTorch Commit: 6bdd8c9
      • Branch: main
      • Test Date: 2026-01-14
      • Python Version: 3.12.11
      • Sprint: Sprint 24

      Failure Pattern:
      Single root cause - tensor comparison failure

      Common Error:

      RuntimeError: Comparing
      
      TensorOrArrayPair(
          id=(),
          actual=tensor([1., 1., 1.,  ..., 1., 1., 1.], device="cuda:0"),
          expected=tensor([1., 1., 1.,  ..., 1., 1., 1.], device="cuda:0"),
          rtol=1.3e-06,
          atol=1e-05,
          equal_nan=True,
          check_device=False,
      
      resulted in the unexpected exception above. If you are a user and see this message during normal operation please file an issue at https://github.com/pytorch/pytorch/issues.
      

      Failing Tests:
      1. test_foreach_copy_with_multi_dtypes_large_input_cuda

      Steps to Reproduce:
      1. Set up PyTorch environment with Python 3.12
      2. Execute the failing test:

         TEST_CONFIG=cpu python3 test/run_test.py -i test_foreach
         TEST_CONFIG=cuda python3 test/run_test.py -i test_foreach
         

      3. Observe RuntimeError during tensor comparison

      Expected Result:
      Test should pass with foreach copy operations producing correct tensor values

      Actual Result:
      Test fails with RuntimeError during tensor comparison despite values appearing identical

      Root Cause Analysis:
      The failure indicates:

      • Tensor comparison framework encountering unexpected exception
      • Values appear identical but comparison fails due to internal error
      • Issue with multi-dtype handling in foreach copy operations
      • Possible dtype conversion or comparison logic bug

      Potential Solutions:
      1. Investigate tensor comparison framework for multi-dtype scenarios
      2. Check foreach copy implementation for large input handling
      3. Review dtype conversion logic in foreach operations
      4. Verify comparison tolerances are appropriate for mixed dtypes
      5. Check for edge cases in tensor comparison with equal_nan=True

      Logs:
      Test execution logs available in CPU test suite
      Log location: /home/ktanmay/Downloads/Run 1-20260120T060019Z-1-001/Run 1/20260114_024940_commit_6bdd8c9/cpu_tests.log

      Additional Context:

      • TestForeachCUDA test running in CPU test suite (may be multi-device test)
      • Test involves copying with multiple dtypes and large input sizes
      • Related ticket AIPCC-8265 exists for similar failure on sGPU platform
      • Error message suggests this is an internal comparison framework issue

      Severity: Medium
      Priority: P3

              Unassigned Unassigned
              pytorch-engineering PyTorch Engineering
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: