Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-8937

[QA][PyTorch UT][CPU] inductor/test_torchinductor_strided_blocks.py - TritonTensorDescriptorTestCUDA failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False

      Summary:
      Test in TritonTensorDescriptorTestCUDA is failing during PyTorch unit test execution on CPU platform with timeout.

      Test Class: inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA
      Number of Failing Tests: 1
      Platform: CPU
      Test Type: Unit Test

      Version Information:

      • PyTorch Commit: 6bdd8c9
      • Branch: main
      • Test Date: 2026-01-14
      • Sprint: Sprint 24

      Failure Pattern:
      Single root cause - test timeout (command exceeded 30 minutes)

      Common Error:

      inductor/test_torchinductor_strided_blocks.py::TritonTensorDescriptorTestCUDA::test_2d_reduction_multi_kernel_cuda Command took >30min, returning 124
      Got exit code 124
      

      Failing Tests:
      1. test_2d_reduction_multi_kernel_cuda

      Steps to Reproduce:
      1. Run test command:

         TEST_CONFIG=cpu python3 test/run_test.py -i inductor/test_torchinductor_strided_blocks
         TEST_CONFIG=cuda python3 test/run_test.py -i inductor/test_torchinductor_strided_blocks
         TEST_CONFIG=inductor python3 test/run_test.py -i inductor/test_torchinductor_strided_blocks
         

      2. Observe test timeout after 30 minutes

      Expected Result:
      Test should complete within timeout period

      Actual Result:
      Test hangs and times out after 30 minutes with exit code 124

      Root Cause Analysis:
      The test is timing out on CPU platform. This is likely because:

      • The test is designed for CUDA/GPU but is being run on CPU
      • Triton tensor descriptor operations are not optimized or supported on CPU
      • Test may be stuck in an infinite loop or very slow computation on CPU

      Potential Solutions:
      1. Skip this test on CPU platform (add platform check)
      2. Investigate why CUDA-specific test is running on CPU
      3. Add shorter timeout for CPU platform
      4. Fix test to properly detect and handle CPU environment

      Additional Context:

      • Note: sGPU ticket AIPCC-8264 exists for the same test class
      • This is the CPU-specific failure
      • Test class name includes "CUDA" suggesting it should only run on GPU
      • Exit code 124 indicates timeout

      Logs:
      Test execution logs: /home/ktanmay/Downloads/Run 1-20260120T060019Z-1-001/Run 1/20260114_024940_commit_6bdd8c9/cpu_tests.log

      Priority: P3

      Labels: pytorch, unittest, cpu, inductor, triton, timeout

              Unassigned Unassigned
              pytorch-engineering PyTorch Engineering
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: