Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-7989

[QA][PyTorch UT][CPU] test_nn tests are failing because of numerical mismatch (AssertionError)

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • PyTorch
    • False
    • Hide

      None

      Show
      None
    • False
    • PyTorch Sprint 21, PyTorch Sprint 22, PyTorch Sprint 23

      test_nn tests are failing because of numerical mismatch - tensor values are not close enough on main branch

      Tests Failing:

      test_partial_flat_weights

      Env details:

      PyTorch version: 2.10.0

      Branch: main

      OS: RHEL 9.6

      CPU: Intel

      python version: 3.12

      commit id : 6de6685797cabc6256df76803f3a5f772d5275a7 (tag: trunk/6de6685797cabc6256df76803f3a5f772d5275a7, origin/main, origin/HEAD)

      Steps to repro:

      1. Log in to H200.
      1. Login to quay.io: podman login quay.io
      1. Pull base image: podman pull quay.io/aipcc/pytorch:rhel_cuda_build_without_pins
      1. Run the image and specify the GPU to be used: podman run -it <IMAGE_NAME>
      1. Run the PyTorch UT: TEST_CONFIG=cpu python3 test/run_test.py -i test_nn

      Expected result: UTs should run fine.

      Actual result: Numerical mismatch - tensor values are not close enough (9 out of 36 elements differ, greatest absolute difference: 3.0120834708213806e-05, greatest relative difference: 0.0030780492816120386, exceeding tolerance of 1e-05 absolute and 1.3e-06 relative)

      Logs are attached below

        1. test_nn.log
          1.17 MB
          Nayan Bhushan Kanganahalli Nagabhushana

              rh-ee-nkangana Nayan Bhushan Kanganahalli Nagabhushana
              rh-ee-nkangana Nayan Bhushan Kanganahalli Nagabhushana
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: