Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10080

notebook-cuda13.0-ubi9 jobs fail to build tensorflow package

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • rhoai-3.3
    • Accelerator Enablement
    • False
    • Hide

      None

      Show
      None
    • False
    • AIPCC Accelerators 25, AIPCC Accelerators 26
    • Proposed

      Description of problem:

      The 3.3 jobs notebook-cuda13.0-ubi9-aarch64 and notebook-cuda13.0-ubi9-x86_64 fail to complete because they are unable to compile tensorflow package. The package does not support CUDA 13.0.

      Version numbers (base image, wheels, builder, etc):

      Builder v27.2.0
      tensorflow-2.20.0

      Steps to Reproduce:

      https://gitlab.com/redhat/rhel-ai/rhai/pipeline/-/jobs/13015359572

      https://gitlab.com/redhat/rhel-ai/rhai/pipeline/-/jobs/13014742985

      Actual results:

      tensorflow-2.20.0: ERROR: /mnt/work-dir/tensorflow-2.20.0/tensorflow-2.20.0/tensorflow/core/kernels/mlir_generated/BUILD:948:19: Generating kernel '//tensorflow/core/kernels/mlir_generated:less_gpu_less_kernels_gpu_ui64_i1_kernel_generator' failed: (Exit 1): hlo_to_kernel failed: error executing compile command (from target //tensorflow/core/kernels/mlir_generated:less_gpu_less_kernels_gpu_ui64_i1_kernel_generator) bazel-out/k8-opt-exec-ST-0465588ec812/bin/tensorflow/compiler/mlir/tools/kernel_gen/hlo_to_kernel '--tile_sizes=1024' '--host-triple=x86_64-unknown-linux-gnu' ... (remaining 6 arguments skipped)
      tensorflow-2.20.0: 2026-02-06 10:27:06.980291: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
      tensorflow-2.20.0: 2026-02-06 10:27:06.986712: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
      tensorflow-2.20.0: <unknown>:0: error: fatbinary exited with non-zero error code 256, output: fatbinary fatal   : Unknown option '-image'
      tensorflow-2.20.0: 
      tensorflow-2.20.0: <unknown>:0: note: see current operation: 
      tensorflow-2.20.0: "gpu.module"() <{sym_name = "Less_GPU_DT_UINT64_DT_BOOL_kernel"}> ({
      tensorflow-2.20.0:   "llvm.func"() <{CConv = #llvm.cconv<ccc>, arg_attrs = [{}, {}, {llvm.align = 16 : index}, {}, {}, {llvm.align = 16 : index}, {}, {}, {}, {}, {llvm.align = 16 : index, llvm.noalias}, {}, {}, {}], function_type = !llvm.func<void (i32, ptr, ptr, i32, ptr, ptr, i32, i32, i32, ptr, ptr, i32, i32, i32)>, linkage = #llvm.linkage<external>, sym_name = "Less_GPU_DT_UINT64_DT_BOOL_kernel", visibility_ = 0 : i64}> ({
      tensorflow-2.20.0:   ^bb0(%arg0: i32, %arg1: !llvm.ptr, %arg2: !llvm.ptr, %arg3: i32, %arg4: !llvm.ptr, %arg5: !llvm.ptr, %arg6: i32, %arg7: i32, %arg8: i32, %arg9: !llvm.ptr, %arg10: !llvm.ptr, %arg11: i32, %arg12: i32, %arg13: i32):
      tensorflow-2.20.0:     %0 = "llvm.mlir.constant"() <{value = 0 : index}> : () -> i32
      tensorflow-2.20.0:     %1 = "llvm.mlir.constant"() <{value = 1 : index}> : () -> i32
      tensorflow-2.20.0:     %2 = "llvm.mlir.constant"() <{value = 1024 : index}> : () -> i32
      tensorflow-2.20.0:     %3 = "llvm.mlir.constant"() <{value = -1024 : index}> : () -> i32
      tensorflow-2.20.0:     %4 = "nvvm.read.ptx.sreg.ctaid.x"() : () -> i32
      tensorflow-2.20.0:     %5 = "nvvm.read.ptx.sreg.tid.x"() <{range = #llvm.constant_range<i32, 0, 1024>}> : () -> i32
      tensorflow-2.20.0:     %6 = "llvm.mul"(%4, %2) <{overflowFlags = 1 : i32}> : (i32, i32) -> i32
      tensorflow-2.20.0:     %7 = "llvm.mul"(%4, %3) <{overflowFlags = 1 : i32}> : (i32, i32) -> i32
      tensorflow-2.20.0:     %8 = "llvm.add"(%arg0, %7) <{overflowFlags = 0 : i32}> : (i32, i32) -> i32
      tensorflow-2.20.0:     %9 = "llvm.intr.smin"(%8, %2) : (i32, i32) -> i32
      tensorflow-2.20.0:     %10 = "llvm.icmp"(%5, %9) <{predicate = 2 : i64}> : (i32, i32) -> i1
      tensorflow-2.20.0:     "llvm.cond_br"(%10)[^bb1, ^bb2] <{operandSegmentSizes = array<i32: 1, 0, 0>}> : (i1) -> ()
      tensorflow-2.20.0:   ^bb1:  // pred: ^bb0
      tensorflow-2.20.0:     %11 = "llvm.add"(%5, %6) <{overflowFlags = 0 : i32}> : (i32, i32) -> i32
      tensorflow-2.20.0:     %12 = "llvm.load"(%arg2) <{ordering = 0 : i64}> : (!llvm.ptr) -> i64
      tensorflow-2.20.0:     %13 = "llvm.getelementptr"(%arg5, %11) <{elem_type = i64, noWrapFlags = 7 : i32, rawConstantIndices = array<i32: -2147483648>}> : (!llvm.ptr, i32) -> !llvm.ptr
      tensorflow-2.20.0:     %14 = "llvm.load"(%13) <{ordering = 0 : i64}> : (!llvm.ptr) -> i64
      tensorflow-2.20.0:     %15 = "llvm.icmp"(%12, %14) <{predicate = 6 : i64}> : (i64, i64) -> i1
      tensorflow-2.20.0:     %16 = "llvm.getelementptr"(%arg10, %11) <{elem_type = i1, noWrapFlags = 7 : i32, rawConstantIndices = array<i32: -2147483648>}> : (!llvm.ptr, i32) -> !llvm.ptr
      tensorflow-2.20.0:     "llvm.store"(%15, %16) <{ordering = 0 : i64}> : (i1, !llvm.ptr) -> ()
      tensorflow-2.20.0:     "llvm.br"()[^bb2] : () -> ()
      tensorflow-2.20.0:   ^bb2:  // 2 preds: ^bb0, ^bb1
      tensorflow-2.20.0:     "llvm.return"() : () -> ()
      tensorflow-2.20.0:   }) {gpu.kernel, gpu.known_block_size = array<i32: 1024, 1, 1>, nvvm.kernel, nvvm.maxntid = array<i32: 1024, 1, 1>} : () -> ()
      tensorflow-2.20.0: }) {dlti.dl_spec = #dlti.dl_spec<index = 32 : i32>} : () -> ()
      tensorflow-2.20.0: <unknown>:0: error: fatbinary exited with non-zero error code 256, output: fatbinary fatal   : Unknown option '-image'
      tensorflow-2.20.0: 
      tensorflow-2.20.0: <unknown>:0: note: see current operation: 
      ...
      tensorflow-2.20.0: 2026-02-06 10:27:07.367504: E tensorflow/compiler/mlir/tools/kernel_gen/hlo_to_kernel.cc:246] INTERNAL: Generating device code failed.
      tensorflow-2.20.0: [29,613 / 35,356] [Prepa] Compiling xla/hlo/translate/hlo_to_mhlo/hlo_to_mlir_hlo.cc
      tensorflow-2.20.0: Target //tensorflow/tools/pip_package:wheel failed to build
      tensorflow-2.20.0: Use --verbose_failures to see the command lines of failed build steps.
      tensorflow-2.20.0: INFO: Elapsed time: 1970.094s, Critical Path: 229.50s
      tensorflow-2.20.0: INFO: 29613 processes: 6497 internal, 23116 local.
      tensorflow-2.20.0: ERROR: Build did NOT complete successfully
       52%|██████████████████▎                | 209/399 [38:05<6:46:40, 128.42s/pkg]10:27:08 ERROR tensorflow-2.20.0: Failed to build tensorflow==2.20.0: Command '['/opt/app-root/lib64/python3.12/site-packages/fromager/run_network_isolation.sh', 'bazel', 'build', '--config', 'cuda', '--config', 'cuda_wheel', '--repo_env=USE_PYWRAP_RULES=1', '--repo_env=TF_VERSION=2.20.0', '--repo_env=ML_WHEEL_TYPE=custom', '--repo_env=ML_WHEEL_VERSION_SUFFIX=+redhat', '--repo_env=WHEEL_NAME=tensorflow', '//tensorflow/tools/pip_package:wheel']' returned non-zero exit status 1.    

      Expected results:

      Pipeline should not fail

      Additional info:

              spryor@redhat.com Sean Pryor
              cheimes@redhat.com Christian Heimes
              Frank's Team
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: