-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
RHOAI, Training
-
False
-
False
-
None
-
1
-
PSAP - General-10, PSAP - General-11, PSAP - General-12, PSAP - General-13
User Story
the goal is to get multi-node training going on the using both ethernet and RDMA
the modules I target are:
Meta-Llama-3-8B-Instruct
Meta-Llama-3-70B-Instruct
the hardware for this task is 3 X A30 GPU's - 1 per host - total of 3 hosts