-
Bug
-
Resolution: Not a Bug
-
Undefined
-
rhelai-1.4
-
False
-
-
False
-
-
-
Important
-
Approved
To Reproduce Steps to reproduce the behavior:
Jira won't allow below list to be numbered correctly. Please take this as a numbered list.
podman pull registry.stage.redhat.io/rhelai1/instructlab-intel-rhel9:1.4-1738240991
podman run -it --hooks-dir /tmp --device /dev/accel --device /dev/infiniband instructlab-intel-rhel9:1.4-1738240991 /bin/bash
(app-root)$ PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve instructlab/granite-7b-lab --dtype bfloat16 --distributed-executor-backend mp -tp 4
(app-root)$ vllm chat # in a separate tmux or something
Expected behavior
- vLLM server should begin boot process, go through its warmup, and start serving model.
- User should be able to chat with model without it failing.
Device Info (please complete the following information):
- Hardware Specs:
- 8xGaudi 3
- OS Version: Ubuntu 22.04
- InstructLab Version: 0.23.1
- Machine was returned to SMC before hardware dump could be recorded for this bug report.
Bug impact
- Inference via vLLM is not functioning in this image. Users will not be able to serve models from their Gaudi 3 machines via vLLM.
Known workaround
- None
Additional context
- The version of vLLM present in this image should be the released version v0.6.4 plus a commit that enables the MP backend for distributed inference, using >1 Gaudi 3 device.
- In the TP=1 case, vLLM seems to behave correctly. The attached logs reflect that.
- In the TP=2 case, vLLM boots, but cannot handle an incoming request.
- In the TP=4 case, vLLM cannot boot.