-
Bug
-
Resolution: Done
-
Major
-
RHELAI 1.3 GA
-
None
To Reproduce Steps to reproduce the behavior:
- Use ilab image quay.io/redhat-user-workloads/rhel-ai-tenant/instructlab-nvidia/instructlab-nvidia:cffec38107333632f9d2eabe640d99002d1659a8
- Update `ilab` script to use the above image
- Run SDG
- Run training `ilab train --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/knowledge_train_msgs_2024-11-18T10_45_05.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/skills_15k.jsonl`
Execution fails with error:
[rank5]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory
Deepspeed:
(app-root) /$ pip freeze show | grep deepspeed deepspeed==0.15.2
Also console showed this :
Creating extension directory /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam... Loading checkpoint shards: 33%|███▎ | 1/3 [00:00<00:00, 3.17it/s]Detected CUDA files, patching ldflags Emitting ninja build file /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/build.ninja... /opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 4.02it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.92it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 4.05it/s] Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 4.02it/s] Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 4.23it/s] Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 4.07it/s] Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 3.99it/s] Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... Using /var/home/cloud-user/.cache/torch_extensions/py311_cu124 as PyTorch extensions root... [1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o FAILED: multi_tensor_adam.cuda.o /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_89,code=sm_89 -gencode=arch=compute_89,code=compute_89 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o In file included from /opt/app-root/lib/python3.11/site-packages/torch/include/ATen/cuda/CUDAContext.h:3, from /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu:13: /opt/app-root/lib/python3.11/site-packages/torch/include/ATen/cuda/CUDAContextLight.h:7:10: fatal error: cusparse.h: No such file or directory 7 | #include <cusparse.h> | ^~~~~~~~~~~~ compilation terminated. [2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/includes -I/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/torch/csrc/api/include -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/TH -isystem /opt/app-root/lib64/python3.11/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.11 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o ninja: build stopped: subcommand failed. [rank6]: Traceback (most recent call last): [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build [rank6]: subprocess.run( [rank6]: File "/usr/lib64/python3.11/subprocess.py", line 571, in run [rank6]: raise CalledProcessError(retcode, process.args, [rank6]: subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. [rank6]: The above exception was the direct cause of the following exception: [rank6]: Traceback (most recent call last): [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module> [rank6]: main(args) [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main [rank6]: model, lr_scheduler, optimizer, accelerator = setup_model( [rank6]: ^^^^^^^^^^^^ [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model [rank6]: optimizer = setup_optimizer(args, model) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer [rank6]: optimizer = FusedAdam( [rank6]: ^^^^^^^^^^ [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__ [rank6]: fused_adam_cuda = FusedAdamBuilder().load() [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load [rank6]: return self.jit_load(verbose) [rank6]: ^^^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load [rank6]: op_module = load(name=self.name, [rank6]: ^^^^^^^^^^^^^^^^^^^^ [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load [rank6]: return _jit_compile( [rank6]: ^^^^^^^^^^^^^ [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile [rank6]: _write_ninja_file_and_build_library( [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library [rank6]: _run_ninja_build( [rank6]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build [rank6]: raise RuntimeError(message) from e [rank6]: RuntimeError: Error building extension 'fused_adam' Loading extension module fused_adam... Loading extension module fused_adam... Loading extension module fused_adam... [rank7]: Traceback (most recent call last): [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module> [rank7]: main(args) [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main [rank7]: model, lr_scheduler, optimizer, accelerator = setup_model( [rank7]: ^^^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model [rank7]: optimizer = setup_optimizer(args, model) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer [rank7]: optimizer = FusedAdam( [rank7]: ^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__ [rank7]: fused_adam_cuda = FusedAdamBuilder().load() [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load [rank7]: return self.jit_load(verbose) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load [rank7]: op_module = load(name=self.name, [rank7]: ^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load [rank7]: return _jit_compile( [rank7]: ^^^^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1747, in _jit_compile [rank7]: return _import_module_from_library(name, build_directory, is_python_module) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library [rank7]: module = importlib.util.module_from_spec(spec) [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank7]: File "<frozen importlib._bootstrap>", line 573, in module_from_spec [rank7]: File "<frozen importlib._bootstrap_external>", line 1234, in create_module [rank7]: File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed [rank7]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory [rank4]: Traceback (most recent call last): [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module> [rank4]: main(args) [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main [rank4]: model, lr_scheduler, optimizer, accelerator = setup_model( [rank4]: ^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model [rank4]: optimizer = setup_optimizer(args, model) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer [rank4]: optimizer = FusedAdam( [rank4]: ^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__ [rank4]: fused_adam_cuda = FusedAdamBuilder().load() [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load [rank4]: return self.jit_load(verbose) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load [rank4]: op_module = load(name=self.name, [rank4]: ^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load [rank4]: return _jit_compile( [rank4]: ^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1747, in _jit_compile [rank4]: return _import_module_from_library(name, build_directory, is_python_module) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library [rank4]: module = importlib.util.module_from_spec(spec) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "<frozen importlib._bootstrap>", line 573, in module_from_spec [rank4]: File "<frozen importlib._bootstrap_external>", line 1234, in create_module [rank4]: File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed [rank4]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory [rank5]: Traceback (most recent call last): [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 917, in <module> [rank5]: main(args) [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 598, in main [rank5]: model, lr_scheduler, optimizer, accelerator = setup_model( [rank5]: ^^^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 224, in setup_model [rank5]: optimizer = setup_optimizer(args, model) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 76, in setup_optimizer [rank5]: optimizer = FusedAdam( [rank5]: ^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in __init__ [rank5]: fused_adam_cuda = FusedAdamBuilder().load() [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load [rank5]: return self.jit_load(verbose) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load [rank5]: op_module = load(name=self.name, [rank5]: ^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1312, in load [rank5]: return _jit_compile( [rank5]: ^^^^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 1747, in _jit_compile [rank5]: return _import_module_from_library(name, build_directory, is_python_module) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/torch/utils/cpp_extension.py", line 2141, in _import_module_from_library [rank5]: module = importlib.util.module_from_spec(spec) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "<frozen importlib._bootstrap>", line 573, in module_from_spec [rank5]: File "<frozen importlib._bootstrap_external>", line 1234, in create_module [rank5]: File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed [rank5]: ImportError: /var/home/cloud-user/.cache/torch_extensions/py311_cu124/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory Loading extension module fused_adam...
Expected behavior
- training to work properly
Screenshots
- Attached Image
Device Info (please complete the following information):
- Hardware Specs: [e.g. Apple M2 Pro Chip, 16 GB Memory, etc.]
- OS Version: [e.g. Mac OS 14.4.1, Fedora Linux 40]
- Python Version: [output of \\{{{}python --version{}}}]
- InstructLab Version: [output of \\{{{}ilab --version{}}}]
Additional context
- <your text here>
- ...