Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-4282

base images: remove TORCH_CUDA_ARCH_LIST and PYTORCH_ROCM_ARCH

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • Development Platform
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • AP Unfinished Issues, AP Sprint 12

      I added TORCH_CUDA_ARCH_LIST and PYTORCH_ROCM_ARCH args and env variables in the hope that the information would be useful. I wanted to have one central place to configure and record supported GPU architectures.

      The idea turned out to cause more problems than benefits:

      • Wheel builder is changing CUDA arch list more often than base images are released.
      • The presence of TORCH_CUDA_ARCH_LIST can slow down vLLM startup, see AIPCC-4016. Some operations and dependencies might compile kernels just-in-time. Without TORCH_CUDA_ARCH_LIST, only cubins for the current GPU arch are compiled. With TORCH_CUDA_ARCH_LIST present, code is compiled for additional archs.

      Let's remove these env ars.

              cheimes@redhat.com Christian Heimes
              cheimes@redhat.com Christian Heimes
              Antonio's Team
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: