Loading...

XML

Word

Printable

Type: Spike
Resolution: Obsolete
Priority: Major
Fix Version/s: Jan 13
Affects Version/s: None
Component/s: None
Labels:
None

Epic Link:
RHOAI Model Serving CPT Q4 2024
Acceptance Criteria:

Hide

Test report with the analysis
Potential blog post if anything interesting is found

Show
Test report with the analysis Potential blog post if anything interesting is found
Workstream:

Inference, RHOAI
Ready:
False
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:
SFDC Cases Open:

Intelligence Requested:
Market:

Investigate the CPU spikes during llm model load times on single and multiple GPUs

Assessment by dagray@redhat.com
–
His experiment was setting a 2 CPU limit on flan-t5-xxl, and he saw CPU throttling but no significant increase in the actual model load time. It is possible that the reason for high cpu utilization for the multi-gpu models is just due to thread initialization similar to the issue we saw with running out of threads, which was addressed by setting RAYON_NUM_THREADS=32. Later next week, Nikhil/Kevin can try setting this to like 4 threads per shard and see if it lowers this peak cpu utilization and how it impacts performance.

–

Assignee:: David Whyte-Gray

Reporter:: Ashish Kamra

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/03/29 3:49 PM

Updated:: 2026/01/13 3:17 PM

Resolved:: 2026/01/05 8:35 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates