Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: rhoai-3.3
Component/s: Accelerator Enablement
Labels:
- blocker

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
RHAI 3.3 bugs
Intelligence Requested:
Market:

Sprint:
AIPCC Accelerators 25, AIPCC Accelerators 26

Release Blocker:
Proposed
Target Version:

rhoai-3.3

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem:

The 3.3 jobs notebook-cuda13.0-ubi9-aarch64 and notebook-cuda13.0-ubi9-x86_64 fail to complete because they are unable to compile tensorflow package. The package does not support CUDA 13.0.

Version numbers (base image, wheels, builder, etc):

Builder v27.2.0
tensorflow-2.20.0

Steps to Reproduce:

https://gitlab.com/redhat/rhel-ai/rhai/pipeline/-/jobs/13015359572

https://gitlab.com/redhat/rhel-ai/rhai/pipeline/-/jobs/13014742985

Actual results:

tensorflow-2.20.0: ERROR: /mnt/work-dir/tensorflow-2.20.0/tensorflow-2.20.0/tensorflow/core/kernels/mlir_generated/BUILD:948:19: Generating kernel '//tensorflow/core/kernels/mlir_generated:less_gpu_less_kernels_gpu_ui64_i1_kernel_generator' failed: (Exit 1): hlo_to_kernel failed: error executing compile command (from target //tensorflow/core/kernels/mlir_generated:less_gpu_less_kernels_gpu_ui64_i1_kernel_generator) bazel-out/k8-opt-exec-ST-0465588ec812/bin/tensorflow/compiler/mlir/tools/kernel_gen/hlo_to_kernel '--tile_sizes=1024' '--host-triple=x86_64-unknown-linux-gnu' ... (remaining 6 arguments skipped)
tensorflow-2.20.0: 2026-02-06 10:27:06.980291: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
tensorflow-2.20.0: 2026-02-06 10:27:06.986712: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
tensorflow-2.20.0: <unknown>:0: error: fatbinary exited with non-zero error code 256, output: fatbinary fatal   : Unknown option '-image'
tensorflow-2.20.0: 
tensorflow-2.20.0: <unknown>:0: note: see current operation: 
tensorflow-2.20.0: "gpu.module"() <{sym_name = "Less_GPU_DT_UINT64_DT_BOOL_kernel"}> ({
tensorflow-2.20.0:   "llvm.func"() <{CConv = #llvm.cconv<ccc>, arg_attrs = [{}, {}, {llvm.align = 16 : index}, {}, {}, {llvm.align = 16 : index}, {}, {}, {}, {}, {llvm.align = 16 : index, llvm.noalias}, {}, {}, {}], function_type = !llvm.func<void (i32, ptr, ptr, i32, ptr, ptr, i32, i32, i32, ptr, ptr, i32, i32, i32)>, linkage = #llvm.linkage<external>, sym_name = "Less_GPU_DT_UINT64_DT_BOOL_kernel", visibility_ = 0 : i64}> ({
tensorflow-2.20.0:   ^bb0(%arg0: i32, %arg1: !llvm.ptr, %arg2: !llvm.ptr, %arg3: i32, %arg4: !llvm.ptr, %arg5: !llvm.ptr, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: !llvm.ptr, %arg10: !llvm.ptr, %arg11: i32, %arg12: i32, %arg13: i32):
tensorflow-2.20.0:     %0 = "llvm.mlir.constant"() <{value = 0 : index}> : () -> i32
tensorflow-2.20.0:     %1 = "llvm.mlir.constant"() <{value = 1 : index}> : () -> i32
tensorflow-2.20.0:     %2 = "llvm.mlir.constant"() <{value = 1024 : index}> : () -> i32
tensorflow-2.20.0:     %3 = "llvm.mlir.constant"() <{value = -1024 : index}> : () -> i32
tensorflow-2.20.0:     %4 = "nvvm.read.ptx.sreg.ctaid.x"() : () -> i32
tensorflow-2.20.0:     %5 = "nvvm.read.ptx.sreg.tid.x"() <{range = #llvm.constant_range<i32, 0, 1024>}> : () -> i32
tensorflow-2.20.0:     %6 = "llvm.mul"(%4, %2) <{overflowFlags = 1 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0:     %7 = "llvm.mul"(%4, %3) <{overflowFlags = 1 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0:     %8 = "llvm.add"(%arg0, %7) <{overflowFlags = 0 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0:     %9 = "llvm.intr.smin"(%8, %2) : (i32, i32) -> i32
tensorflow-2.20.0:     %10 = "llvm.icmp"(%5, %9) <{predicate = 2 : i64}> : (i32, i32) -> i1
tensorflow-2.20.0:     "llvm.cond_br"(%10)[^bb1, ^bb2] <{operandSegmentSizes = array<i32: 1, 0, 0>}> : (i1) -> ()
tensorflow-2.20.0:   ^bb1:  // pred: ^bb0
tensorflow-2.20.0:     %11 = "llvm.add"(%5, %6) <{overflowFlags = 0 : i32}> : (i32, i32) -> i32
tensorflow-2.20.0:     %12 = "llvm.load"(%arg2) <{ordering = 0 : i64}> : (!llvm.ptr) -> i64
tensorflow-2.20.0:     %13 = "llvm.getelementptr"(%arg5, %11) <{elem_type = i64, noWrapFlags = 7 : i32, rawConstantIndices = array<i32: -2147483648>}> : (!llvm.ptr, i32) -> !llvm.ptr
tensorflow-2.20.0:     %14 = "llvm.load"(%13) <{ordering = 0 : i64}> : (!llvm.ptr) -> i64
tensorflow-2.20.0:     %15 = "llvm.icmp"(%12, %14) <{predicate = 6 : i64}> : (i64, i64) -> i1
tensorflow-2.20.0:     %16 = "llvm.getelementptr"(%arg10, %11) <{elem_type = i1, noWrapFlags = 7 : i32, rawConstantIndices = array<i32: -2147483648>}> : (!llvm.ptr, i32) -> !llvm.ptr
tensorflow-2.20.0:     "llvm.store"(%15, %16) <{ordering = 0 : i64}> : (i1, !llvm.ptr) -> ()
tensorflow-2.20.0:     "llvm.br"()[^bb2] : () -> ()
tensorflow-2.20.0:   ^bb2:  // 2 preds: ^bb0, ^bb1
tensorflow-2.20.0:     "llvm.return"() : () -> ()
tensorflow-2.20.0:   }) {gpu.kernel, gpu.known_block_size = array<i32: 1024, 1, 1>, nvvm.kernel, nvvm.maxntid = array<i32: 1024, 1, 1>} : () -> ()
tensorflow-2.20.0: }) {dlti.dl_spec = #dlti.dl_spec<index = 32 : i32>} : () -> ()
tensorflow-2.20.0: <unknown>:0: error: fatbinary exited with non-zero error code 256, output: fatbinary fatal   : Unknown option '-image'
tensorflow-2.20.0: 
tensorflow-2.20.0: <unknown>:0: note: see current operation: 
...
tensorflow-2.20.0: 2026-02-06 10:27:07.367504: E tensorflow/compiler/mlir/tools/kernel_gen/hlo_to_kernel.cc:246] INTERNAL: Generating device code failed.
tensorflow-2.20.0: [29,613 / 35,356] [Prepa] Compiling xla/hlo/translate/hlo_to_mhlo/hlo_to_mlir_hlo.cc
tensorflow-2.20.0: Target //tensorflow/tools/pip_package:wheel failed to build
tensorflow-2.20.0: Use --verbose_failures to see the command lines of failed build steps.
tensorflow-2.20.0: INFO: Elapsed time: 1970.094s, Critical Path: 229.50s
tensorflow-2.20.0: INFO: 29613 processes: 6497 internal, 23116 local.
tensorflow-2.20.0: ERROR: Build did NOT complete successfully
 52%|██████████████████▎                | 209/399 [38:05<6:46:40, 128.42s/pkg]10:27:08 ERROR tensorflow-2.20.0: Failed to build tensorflow==2.20.0: Command '['/opt/app-root/lib64/python3.12/site-packages/fromager/run_network_isolation.sh', 'bazel', 'build', '--config', 'cuda', '--config', 'cuda_wheel', '--repo_env=USE_PYWRAP_RULES=1', '--repo_env=TF_VERSION=2.20.0', '--repo_env=ML_WHEEL_TYPE=custom', '--repo_env=ML_WHEEL_VERSION_SUFFIX=+redhat', '--repo_env=WHEEL_NAME=tensorflow', '//tensorflow/tools/pip_package:wheel']' returned non-zero exit status 1.

Expected results:

Pipeline should not fail

Additional info:

is related to

AIPCC-10079 notebook-cpu-ubi9-x86_64 build fails with missing tensorflow-cpu

Closed

mentioned on

Merge request - AIPCC-10080: Disable TensorFlow build for CUDA 13

Merge request - AIPCC-10087, AIPCC-10080: Downgrade Builder to 26-maint branch, disable TensorFlow for CUDA 13 [3.3]

Solved by commit a1db45a4d819a98d2036261f17619e55eb5393fa.

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty