Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2443

GPU Resource Reservation for Serving, Training or SDG

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected

      Feature Overview

      Enhance InstructLab by enabling users to reserve a subset of GPUs for serving, training or SDG tasks. This feature will provide better resource management, improved task performance, and cost optimization.

      Goals

      • Extend InstructLab to support and option for reserving or partitioning GPUs for training, serving or SDG. This new option will allow users to select a subset of available GPUs for their task, ensuring optimal resource utilization.

      Requirements:

        - Users should be able to reserve a specific number of GPUs for their task.
        - The system should validate if the requested GPUs are available.
        - The system should ensure that the reserved GPUs are not used for other tasks until the reservation period ends.
        - The system should provide a user-friendly interface to manage reservations.

      Background

      Currently, all available GPUs are allocated for any task, which may lead to resource wastage and potential performance issues. By implementing GPU resource reservation or allocation, we can address these concerns and improve the overall user experience.

      Done

      • [ ] A User-facing option for reserving GPUs has been designed and implemented.
      • [ ] System validation for GPU availability has been implemented.
      • [ ] Reserved GPUs are not used for other tasks until the reservation period ends.
      • [ ] User can manage their reservations through a CLI flag or configuration.

      Questions to Answer

      • How should the system handle cases where the requested number of GPUs is not available?
      • Should there be a limit to the duration of a GPU reservation?

      Out of Scope

      • Resource optimization algorithms for dynamic GPU allocation.
      • Integration with external GPU management systems.

      Customer Considerations

      • The system should provide clear notifications when GPUs are about to expire from a reservation.
      • Users should have the option to renew their reservations if needed.

              jepandit@redhat.com Jehlum Vitasta Pandit
              wcabanba@redhat.com William Caban
              William Caban
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: