Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-10853

Add array_record into the RHAI pipeline onboarding collection

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Add package 'array_record' into the RHAI pipeline onboarding collection.

      The package requires builder repository onboarding before it can be added to the RHAI pipeline. This ticket is blocked by the builder onboarding ticket.

      Summary

      Executive Summary: array_record Build Analysis

      array_record is a Google-developed, high-performance file format library for ML IO workloads, built on top of Google's Riegeli library. It is licensed under Apache-2.0, which is fully compatible with Red Hat redistribution requirements. The package is a hard dependency for tensorflow-datasets and has no substitutes – it is the only implementation of the ArrayRecord format. The recommended target version is 0.8.3 (latest stable on PyPI).

      The primary challenge for source builds is that array_record uses Bazel exclusively as its build system – there is no setuptools, CMake, or Meson fallback for compiling the C++ native extension. The entire C++ dependency chain (Riegeli, Abseil, Protobuf 28.3, Eigen, pybind11) is resolved hermetically through Bazel's module system. Building from source requires cloning the Git repository, running Bazel 7.2.1 inside a manylinux2014 Docker container, and using auditwheel repair to produce self-contained wheels. Build complexity is rated 9/10. No sdist is published to PyPI. For x86_64 Linux, the build path is well-established and has no blockers; however, ppc64le is not supported due to unresolved HighwayHash linker failures and Riegeli API incompatibilities (issues #149, #151).

      A critical runtime constraint exists: Protobuf must be pinned to version 28.3. Using Protobuf 29.x causes segfaults when co-loaded with TensorFlow – the primary use case for this package. This pinning is already enforced in the upstream MODULE.bazel via single_version_override. Version 0.8.0 was yanked from PyPI due to this exact issue; current v0.8.3 is stable. The [beam] extra should not be included in default installs to avoid pulling in the heavy apache-beam[gcp] dependency.

      Recommended approach: Replicate the upstream CI pipeline using a Dockerized manylinux2014 environment with Bazel 7.2.1. As a pragmatic fallback, upstream PyPI wheels are self-contained (all C++ deps statically linked, only standard system libs required at runtime) and available for x86_64, aarch64, and macOS ARM64 across Python 3.11--3.14.

      Key Build Commands (inside manylinux2014 container)

      # Install Bazel 7.2.1
      curl -sSL -o /usr/local/bin/bazel \
        "https://github.com/bazelbuild/bazel/releases/download/7.2.1/bazel-7.2.1-linux-x86_64"
      chmod +x /usr/local/bin/bazel
      
      # Set environment
      export PYTHON_VERSION=3.12 PYTHON_MAJOR_VERSION=3 PYTHON_MINOR_VERSION=12
      export BAZEL_VERSION=7.2.1 AUDITWHEEL_PLATFORM=manylinux2014_x86_64
      export PYTHON_BIN=$(which python3)
      
      # Build wheel
      bash oss/build_whl.sh
      

      Critical Findings

      • Build system: Bazel 7.2.1 only – no alternative compilation path exists
      • Protobuf pinning: Must use 28.3; 29.x causes segfaults with TensorFlow
      • Platform support: x86_64 and aarch64 Linux supported; ppc64le blocked upstream
      • License: Apache-2.0 – fully compliant for redistribution
      • No substitutes: Only implementation of ArrayRecord format; required by tensorflow-datasets

              epacific@redhat.com Einat Pacifici
              aipcc-jira-bot@redhat.com AIPCC JIRABOT
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: