Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1335

Investigate llm model load times

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Obsolete
    • Icon: Major Major
    • Jan 13
    • None
    • None
    • None
    • Hide

      Test report with the analysis
      Potential blog post if anything interesting is found

      Show
      Test report with the analysis Potential blog post if anything interesting is found
    • Inference, RHOAI
    • False
    • False
    • Hide

      None

      Show
      None

      Investigate the CPU spikes during llm model load times on single and multiple GPUs

      Assessment by dagray@redhat.com 

      His experiment was setting a 2 CPU limit on flan-t5-xxl, and he saw CPU throttling but no significant increase in the actual model load time. It is possible that the reason for high cpu utilization for the multi-gpu models is just due to thread initialization similar to the issue we saw with running out of threads, which was addressed by setting RAYON_NUM_THREADS=32. Later next week, Nikhil/Kevin can try setting this to like 4 threads per shard and see if it lowers this peak cpu utilization and how it impacts performance.

              dagray@redhat.com David Whyte-Gray
              akamra8979 Ashish Kamra
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: