Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-2060

Redesign RH-Benchmark Report Generation for Scalable Collaboration

    • False
    • Hide

      None

      Show
      None
    • False

      Objective

      Currently, when MLEs work on the same report in J-Benchmark, they must be on the same Git branch. This constraint causes inefficiencies, limits parallel work, and complicates collaboration. The goal of this epic is to define and implement a new approach for generating reports that:

      • Eliminates this Git-based limitation
      • Enables simple and flexible scheduling of benchmarks
      • Improves collaboration and parallel development
      • Maintains scalability and ease of maintenance

      Background

      • Reports in J-Benchmark are currently defined using a class-based system in code.
      • This ties report development directly to the Git branch, making collaboration slow and error-prone.
      • There's no easy way for multiple MLEs to work on the same report independently or schedule report runs flexibly.

      Key Phases


      Phase 1: Discover How MLEs Collaborate Today

      • Interview or shadow MLEs working on shared reports
      • Identify:
        • Specific collaboration pain points
        • Git-related limitations
        • Workarounds currently used
      • Output: Summary of current workflows and challenges

      Phase 2: Document Pros and Cons of the Class-Based Approach

      • Describe the current class-based system used for report definitions
      • List its strengths and weaknesses across:
        • Maintainability
        • Performance
        • Flexibility
        • Collaboration
      • Emphasize the branching constraint as a core blocker
      • Output: Structured pros/cons document

      Phase 3: Propose Alternative Solutions

      • Provide a list of viable alternatives to the current method:
        • Option 1: Improve the class-based system
        • Option 2: Move to a database-driven approach
        • Option 3: Use a hybrid approach (code + config)
        • Option 4 >> ADD yours!!!!!
      • No need to decide yet — these are examples for discussion.
      • Output: Short write-up listing pros/cons trade-offs per direction

      Phase 4: Present to R&D for Discussion and Decision

      • Present the current state, issues, and options to the R&D team
      • Invite suggestions for additional solutions
      • Emphasize the goal: make benchmark scheduling and report iteration simple and efficient
      • Facilitate an open discussion and alignment on next steps
      • Output: Preferred direction selected and documented

      Phase 5: Implement the Approved Approach


      Expected Deliverables

      • Collaboration workflow summary
      • Class-based approach pros/cons
      • List of possible alternative approaches
      • R&D presentation and selected path
      • Implementation plan + working solution

      Note:

      1. Please make sure to design the separation in a way that supports triggering benchmarks in both workflows:

       

      - The standard flow, where the model spins up at execution time

      - The alternate flow, where the model is already running and the user simply provides a URL + API key


      **

       

      2. Please review how llm-d-benchmark implemented this capability. In the end, there are a few additional components that need to be addressed beyond the basic decomposition:

       

      - Ensure that models can be benchmarked without having to be “pushed” into the codebase—just like llm-d-benchmark does.

      - To extend on point above: we need to support running benchmarks without adding models to the code. However, if we take this approach, keep in mind that some validations we run behind the scenes won’t trigger, and benchmark integrity might be compromised.

      Similarly, even when a model is defined in code, we should still provide the option to run it with or without validation.

       

      The goal here is to make it easy for MLEs to run benchmarks without “overthinking” the process.

        

       

       

              rh-ee-abadli Aviran Badli (Inactive)
              rh-ee-abadli Aviran Badli (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: