-
Bug
-
Resolution: Done
-
Critical
-
None
-
rhoai-3.3
Description of problem:
The 3.3 jobs notebook-cuda13.0-ubi9-aarch64 and notebook-cuda13.0-ubi9-x86_64 fail to complete because they are unable to compile tensorflow package. The package does not support CUDA 13.0.
Version numbers (base image, wheels, builder, etc):
Builder v27.2.0 tensorflow-2.20.0
Steps to Reproduce:
https://gitlab.com/redhat/rhel-ai/rhai/pipeline/-/jobs/13015359572
https://gitlab.com/redhat/rhel-ai/rhai/pipeline/-/jobs/13014742985
Actual results:
tensorflow-2.20.0: ERROR: /mnt/work-dir/tensorflow-2.20.0/tensorflow-2.20.0/tensorflow/core/kernels/mlir_generated/BUILD:948:19: Generating kernel '//tensorflow/core/kernels/mlir_generated:less_gpu_less_kernels_gpu_ui64_i1_kernel_generator' failed: (Exit 1): hlo_to_kernel failed: error executing compile command (from target //tensorflow/core/kernels/mlir_generated:less_gpu_less_kernels_gpu_ui64_i1_kernel_generator) bazel-out/k8-opt-exec-ST-0465588ec812/bin/tensorflow/compiler/mlir/tools/kernel_gen/hlo_to_kernel '--tile_sizes=1024' '--host-triple=x86_64-unknown-linux-gnu' ... (remaining 6 arguments skipped)
tensorflow-2.20.0: 2026-02-06 10:27:06.980291: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
tensorflow-2.20.0: 2026-02-06 10:27:06.986712: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
tensorflow-2.20.0: <unknown>:0: error: fatbinary exited with non-zero error code 256, output: fatbinary fatal : Unknown option '-image'
tensorflow-2.20.0:
tensorflow-2.20.0: <unknown>:0: note: see current operation:
tensorflow-2.20.0: "gpu.module"() <{sym_name = "Less_GPU_DT_UINT64_DT_BOOL_kernel"}> ({
tensorflow-2.20.0: "llvm.func"() <{CConv = #llvm.cconv<ccc>, arg_attrs = [{}, {}, {llvm.align = 16 : index}, {}, {}, {llvm.align = 16 : index}, {}, {}, {}, {}, {llvm.align = 16 : index, llvm.noalias}, {}, {}, {}], function_type = !llvm.func<void (i32, ptr, ptr, i32, ptr, ptr, i32, i32, i32, ptr, ptr, i32, i32, i32)>, linkage = #llvm.linkage<external>, sym_name = "Less_GPU_DT_UINT64_DT_BOOL_kernel", visibility_ = 0 : i64}> ({
tensorflow-2.20.0: ^bb0(%arg0: i32, %arg1: !llvm.ptr, %arg2: !llvm.ptr, %arg3: i32, %arg4: !llvm.ptr, %arg5: !llvm.ptr, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: !llvm.ptr, %arg10: !llvm.ptr, %arg11: i32, %arg12: i32, %arg13: i32):
tensorflow-2.20.0: %0 = "llvm.mlir.constant"() <{value = 0 : index}> : () -> i32
tensorflow-2.20.0: %1 = "llvm.mlir.constant"() <{value = 1 : index}> : () -> i32
tensorflow-2.20.0: %2 = "llvm.mlir.constant"() <{value = 1024 : index}> : () -> i32
tensorflow-2.20.0: %3 = "llvm.mlir.constant"() <{value = -1024 : index}> : () -> i32
tensorflow-2.20.0: %4 = "nvvm.read.ptx.sreg.ctaid.x"() : () -> i32
tensorflow-2.20.0: %5 = "nvvm.read.ptx.sreg.tid.x"() <{range = #llvm.constant_range<i32, 0, 1024>}> : () -> i32
tensorflow-2.20.0: %6 = "llvm.mul"(%4, %2) <{overflowFlags = 1 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0: %7 = "llvm.mul"(%4, %3) <{overflowFlags = 1 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0: %8 = "llvm.add"(%arg0, %7) <{overflowFlags = 0 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0: %9 = "llvm.intr.smin"(%8, %2) : (i32, i32) -> i32
tensorflow-2.20.0: %10 = "llvm.icmp"(%5, %9) <{predicate = 2 : i64}> : (i32, i32) -> i1
tensorflow-2.20.0: "llvm.cond_br"(%10)[^bb1, ^bb2] <{operandSegmentSizes = array<i32: 1, 0, 0>}> : (i1) -> ()
tensorflow-2.20.0: ^bb1: // pred: ^bb0
tensorflow-2.20.0: %11 = "llvm.add"(%5, %6) <{overflowFlags = 0 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0: %12 = "llvm.load"(%arg2) <{ordering = 0 : i64}> : (!llvm.ptr) -> i64
tensorflow-2.20.0: %13 = "llvm.getelementptr"(%arg5, %11) <{elem_type = i64, noWrapFlags = 7 : i32, rawConstantIndices = array<i32: -2147483648>}> : (!llvm.ptr, i32) -> !llvm.ptr
tensorflow-2.20.0: %14 = "llvm.load"(%13) <{ordering = 0 : i64}> : (!llvm.ptr) -> i64
tensorflow-2.20.0: %15 = "llvm.icmp"(%12, %14) <{predicate = 6 : i64}> : (i64, i64) -> i1
tensorflow-2.20.0: %16 = "llvm.getelementptr"(%arg10, %11) <{elem_type = i1, noWrapFlags = 7 : i32, rawConstantIndices = array<i32: -2147483648>}> : (!llvm.ptr, i32) -> !llvm.ptr
tensorflow-2.20.0: "llvm.store"(%15, %16) <{ordering = 0 : i64}> : (i1, !llvm.ptr) -> ()
tensorflow-2.20.0: "llvm.br"()[^bb2] : () -> ()
tensorflow-2.20.0: ^bb2: // 2 preds: ^bb0, ^bb1
tensorflow-2.20.0: "llvm.return"() : () -> ()
tensorflow-2.20.0: }) {gpu.kernel, gpu.known_block_size = array<i32: 1024, 1, 1>, nvvm.kernel, nvvm.maxntid = array<i32: 1024, 1, 1>} : () -> ()
tensorflow-2.20.0: }) {dlti.dl_spec = #dlti.dl_spec<index = 32 : i32>} : () -> ()
tensorflow-2.20.0: <unknown>:0: error: fatbinary exited with non-zero error code 256, output: fatbinary fatal : Unknown option '-image'
tensorflow-2.20.0:
tensorflow-2.20.0: <unknown>:0: note: see current operation:
...
tensorflow-2.20.0: 2026-02-06 10:27:07.367504: E tensorflow/compiler/mlir/tools/kernel_gen/hlo_to_kernel.cc:246] INTERNAL: Generating device code failed.
tensorflow-2.20.0: [29,613 / 35,356] [Prepa] Compiling xla/hlo/translate/hlo_to_mhlo/hlo_to_mlir_hlo.cc
tensorflow-2.20.0: Target //tensorflow/tools/pip_package:wheel failed to build
tensorflow-2.20.0: Use --verbose_failures to see the command lines of failed build steps.
tensorflow-2.20.0: INFO: Elapsed time: 1970.094s, Critical Path: 229.50s
tensorflow-2.20.0: INFO: 29613 processes: 6497 internal, 23116 local.
tensorflow-2.20.0: ERROR: Build did NOT complete successfully
52%|██████████████████▎ | 209/399 [38:05<6:46:40, 128.42s/pkg]10:27:08 ERROR tensorflow-2.20.0: Failed to build tensorflow==2.20.0: Command '['/opt/app-root/lib64/python3.12/site-packages/fromager/run_network_isolation.sh', 'bazel', 'build', '--config', 'cuda', '--config', 'cuda_wheel', '--repo_env=USE_PYWRAP_RULES=1', '--repo_env=TF_VERSION=2.20.0', '--repo_env=ML_WHEEL_TYPE=custom', '--repo_env=ML_WHEEL_VERSION_SUFFIX=+redhat', '--repo_env=WHEEL_NAME=tensorflow', '//tensorflow/tools/pip_package:wheel']' returned non-zero exit status 1.
Expected results:
Pipeline should not fail
Additional info:
- is related to
-
AIPCC-10079 notebook-cpu-ubi9-x86_64 build fails with missing tensorflow-cpu
-
- Closed
-
- mentioned on