Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8265

[QA][PyTorch UT][sGPU] test/test_foreach.py - TestForeachCUDA failures

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major Major
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      *Test Class:* test/test_foreach.py::TestForeachCUDA
      *Failing Tests:* 1
      *Error Pattern:* related_issues

          1. Description

      Summary:
      1 test(s) in TestForeachCUDA are failing during PyTorch unit test execution on sGPU platform.

      Test Class: test/test_foreach.py::TestForeachCUDA
      Number of Failing Tests: 1
      Platform: sGPU
      Test Type: Unit Test

      Version Information:

      • PyTorch Commit: 4816fd9
      • Test Date: 2025-12-22
      • Pipeline ID: 2217097191
      • Platform: sGPU

      Failure Pattern:
      Tests failing with 3 related error patterns - likely common root cause

      Error Patterns:
      1.

      File "/miniconda/envs/cuda_torch_build/lib/python3.12/site-packages/torch/testing/_comparison.py", line 1298, in not_close_error_metas

      2.

      CUDA out of memory. Tried to allocate 8.00 GiB. GPU 0 has a total capacity of 139.80 GiB of which 96.00 GiB is free. Process 1190 has 524.00 MiB memory in use. Process 2903636 has 522.00 MiB memory in use. Process 2903703 has 1.92 GiB memory in use. Process 2921375 has 522.00 MiB memory in use. Including non-PyTorch memory, this process has 39.77 GiB memory in use. Process 2930046 has 522.00 MiB memory in use. 46.13 GiB allowed; Of the allocated memory 39.00 GiB is allocated by PyTorch, and 12.0

      3.

      CUDA out of memory. Tried to allocate 8.00 GiB. GPU 0 has a total capacity of 139.80 GiB of which 96.52 GiB is free. Process 1190 has 524.00 MiB memory in use. Process 2903636 has 522.00 MiB memory in use. Process 2903703 has 1.92 GiB memory in use. Process 2921375 has 522.00 MiB memory in use. Including non-PyTorch memory, this process has 39.77 GiB memory in use. 46.13 GiB allowed; Of the allocated memory 39.00 GiB is allocated by PyTorch, and 12.00 MiB is reserved by PyTorch but unallocated. 

      Failing Tests:
      1. test_foreach_copy_with_multi_dtypes_large_input_cuda

      Steps to Reproduce:
      1. Pull the PyTorch test image
      2. Run the failing test class:

         TEST_CONFIG=cuda python3 test/run_test.py -i test_foreach
         

      3. Observe test failures

      Expected Result:
      All tests in TestForeachCUDA should pass

      Actual Result:
      1 test(s) failing with errors shown above

      Logs:
      Pipeline ID: 2217097191
      CI Artifacts: Available in pipeline artifacts

      Additional Context:
      Test failures identified in automated PyTorch CI run.

      Severity: Medium
      Priority: P3

              rh-ee-sugeorge Subin George
              rh-ee-ktanmay Kumar Tanmay
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: