test_nn tests are failing because of numerical mismatch - tensor values are not close enough on main branch
Tests Failing:
test_partial_flat_weights
Env details:
PyTorch version: 2.10.0
Branch: main
OS: RHEL 9.6
CPU: Intel
python version: 3.12
commit id : 6de6685797cabc6256df76803f3a5f772d5275a7 (tag: trunk/6de6685797cabc6256df76803f3a5f772d5275a7, origin/main, origin/HEAD)
Steps to repro:
- Log in to H200.
- Login to quay.io: podman login quay.io
- Pull base image: podman pull quay.io/aipcc/pytorch:rhel_cuda_build_without_pins
- Run the image and specify the GPU to be used: podman run -it <IMAGE_NAME>
- Run the PyTorch UT: TEST_CONFIG=cpu python3 test/run_test.py -i test_nn
Expected result: UTs should run fine.
Actual result: Numerical mismatch - tensor values are not close enough (9 out of 36 elements differ, greatest absolute difference: 3.0120834708213806e-05, greatest relative difference: 0.0030780492816120386, exceeding tolerance of 1e-05 absolute and 1.3e-06 relative)
Logs are attached below