-
Spike
-
Resolution: Obsolete
-
Major
-
None
-
None
-
None
Investigate the CPU spikes during llm model load times on single and multiple GPUs
Assessment by dagray@redhat.com
–
His experiment was setting a 2 CPU limit on flan-t5-xxl, and he saw CPU throttling but no significant increase in the actual model load time. It is possible that the reason for high cpu utilization for the multi-gpu models is just due to thread initialization similar to the issue we saw with running out of threads, which was addressed by setting RAYON_NUM_THREADS=32. Later next week, Nikhil/Kevin can try setting this to like 4 threads per shard and see if it lowers this peak cpu utilization and how it impacts performance.
–