Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2284

RHELAI 1.3: DeepSpeed is missing fused_adam op

XMLWordPrintable

    • Approved

      To Reproduce Steps to reproduce the behavior:

      1. Use ilab image quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia/instructlab-nvidia:cffec38107333632f9d2eabe640d99002d1659a8 
      2. Update `ilab` script to use the above image
      3. Run SDG
      4. Run training `ilab train --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/knowledge_train_msgs_2024-11-18T10_45_05.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/skills_15k.jsonl`

      Execution fails with error:

      [rank5]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

      Deepspeed:

      (app-root) /$ pip freeze show | grep deepspeed
      deepspeed==0.15.2

      Also console showed this :

       

      Creating extension directory /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam...
      Loading checkpoint shards:  33%|███▎      | 1/3 [00:00<00:00,  3.17it/s]Detected CUDA files, patching ldflags
      Emitting ninja build file /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja...
      /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
      If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
        warnings.warn(
      Building extension module fused_adam...
      Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
      Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  4.02it/s]
      Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.92it/s]
      Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  4.05it/s]
      Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
      Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  4.02it/s]
      Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
      Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
      Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  4.23it/s]
      Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
      Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
      Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  4.07it/s]
      Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00,  3.99it/s]
      Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
      Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root...
      [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
      FAILED: multi_tensor_adam.cuda.o 
      /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
      In file included from /opt/app-root/lib/python3.11/site-packages/torch/include/ATen/cuda/CUDAContext.h:3,
                       from /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:13:
      /opt/app-root/lib/python3.11/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:7:10: fatal error: cusparse.h: No such file or directory
          7 | #include <cusparse.h>
            |          ^~~~~~~~~~~~
      compilation terminated.
      [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
      ninja: build stopped: subcommand failed.
      [rank6]: Traceback (most recent call last):
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build
      [rank6]:     subprocess.run(
      [rank6]:   File "/usr/lib64/python3.11/subprocess.py", line 571, in run
      [rank6]:     raise CalledProcessError(retcode, process.args,
      [rank6]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
      [rank6]: The above exception was the direct cause of the following exception:
      [rank6]: Traceback (most recent call last):
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module>
      [rank6]:     main(args)
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main
      [rank6]:     model, lr_scheduler, optimizer, accelerator = setup_model(
      [rank6]:                                                   ^^^^^^^^^^^^
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model
      [rank6]:     optimizer = setup_optimizer(args, model)
      [rank6]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer
      [rank6]:     optimizer = FusedAdam(
      [rank6]:                 ^^^^^^^^^^
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
      [rank6]:     fused_adam_cuda = FusedAdamBuilder().load()
      [rank6]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
      [rank6]:     return self.jit_load(verbose)
      [rank6]:            ^^^^^^^^^^^^^^^^^^^^^^
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
      [rank6]:     op_module = load(name=self.name,
      [rank6]:                 ^^^^^^^^^^^^^^^^^^^^
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load
      [rank6]:     return _jit_compile(
      [rank6]:            ^^^^^^^^^^^^^
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile
      [rank6]:     _write_ninja_file_and_build_library(
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library
      [rank6]:     _run_ninja_build(
      [rank6]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build
      [rank6]:     raise RuntimeError(message) from e
      [rank6]: RuntimeError: Error building extension 'fused_adam'
      Loading extension module fused_adam...
      Loading extension module fused_adam...
      Loading extension module fused_adam...
      [rank7]: Traceback (most recent call last):
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module>
      [rank7]:     main(args)
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main
      [rank7]:     model, lr_scheduler, optimizer, accelerator = setup_model(
      [rank7]:                                                   ^^^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model
      [rank7]:     optimizer = setup_optimizer(args, model)
      [rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer
      [rank7]:     optimizer = FusedAdam(
      [rank7]:                 ^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
      [rank7]:     fused_adam_cuda = FusedAdamBuilder().load()
      [rank7]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
      [rank7]:     return self.jit_load(verbose)
      [rank7]:            ^^^^^^^^^^^^^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
      [rank7]:     op_module = load(name=self.name,
      [rank7]:                 ^^^^^^^^^^^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load
      [rank7]:     return _jit_compile(
      [rank7]:            ^^^^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1747, in _jit_compile
      [rank7]:     return _import_module_from_library(name, build_directory, is_python_module)
      [rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank7]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library
      [rank7]:     module = importlib.util.module_from_spec(spec)
      [rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank7]:   File "<frozen importlib._bootstrap>", line 573, in module_from_spec
      [rank7]:   File "<frozen importlib._bootstrap_external>", line 1234, in create_module
      [rank7]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
      [rank7]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
      [rank4]: Traceback (most recent call last):
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module>
      [rank4]:     main(args)
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main
      [rank4]:     model, lr_scheduler, optimizer, accelerator = setup_model(
      [rank4]:                                                   ^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model
      [rank4]:     optimizer = setup_optimizer(args, model)
      [rank4]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer
      [rank4]:     optimizer = FusedAdam(
      [rank4]:                 ^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
      [rank4]:     fused_adam_cuda = FusedAdamBuilder().load()
      [rank4]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
      [rank4]:     return self.jit_load(verbose)
      [rank4]:            ^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
      [rank4]:     op_module = load(name=self.name,
      [rank4]:                 ^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load
      [rank4]:     return _jit_compile(
      [rank4]:            ^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1747, in _jit_compile
      [rank4]:     return _import_module_from_library(name, build_directory, is_python_module)
      [rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library
      [rank4]:     module = importlib.util.module_from_spec(spec)
      [rank4]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "<frozen importlib._bootstrap>", line 573, in module_from_spec
      [rank4]:   File "<frozen importlib._bootstrap_external>", line 1234, in create_module
      [rank4]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
      [rank4]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
      [rank5]: Traceback (most recent call last):
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module>
      [rank5]:     main(args)
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main
      [rank5]:     model, lr_scheduler, optimizer, accelerator = setup_model(
      [rank5]:                                                   ^^^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model
      [rank5]:     optimizer = setup_optimizer(args, model)
      [rank5]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer
      [rank5]:     optimizer = FusedAdam(
      [rank5]:                 ^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__
      [rank5]:     fused_adam_cuda = FusedAdamBuilder().load()
      [rank5]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
      [rank5]:     return self.jit_load(verbose)
      [rank5]:            ^^^^^^^^^^^^^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
      [rank5]:     op_module = load(name=self.name,
      [rank5]:                 ^^^^^^^^^^^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load
      [rank5]:     return _jit_compile(
      [rank5]:            ^^^^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1747, in _jit_compile
      [rank5]:     return _import_module_from_library(name, build_directory, is_python_module)
      [rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library
      [rank5]:     module = importlib.util.module_from_spec(spec)
      [rank5]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank5]:   File "<frozen importlib._bootstrap>", line 573, in module_from_spec
      [rank5]:   File "<frozen importlib._bootstrap_external>", line 1234, in create_module
      [rank5]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
      [rank5]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
      Loading extension module fused_adam...
      

       

      Expected behavior

      • training to work properly

      Screenshots

      • Attached Image 

      Device Info (please complete the following information):

      • Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
      • OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
      • Python Version: [output of \\{{{}python --version{}}}]
      • InstructLab Version: [output of \\{{{}ilab --version{}}}]

      Additional context

      • <your text here>
      • ...

              cheimes@redhat.com Christian Heimes
              cvultur@redhat.com Constantin Daniel Vultur
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: