-
Bug
-
Resolution: Done
-
Critical
-
rhelai-1.5
-
None
-
False
-
-
False
-
-
-
Approved
To Reproduce Steps to reproduce the behavior:
- Go to DIIP and run a new SDG_ONLY pipeline with latest RHELAI 1.5 bits
- Watch the run
DIIP run logs : https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9942717767
IBM Cloud failure logs: https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9936008492
Expected behavior
- SDG phase should run successfully and it should generate the data as expected.
- But it fails with the below error
failed to generate data with exception: PipelineBlockError(<class 'instructlab.sdg.blocks.llmblock.LLMBlock'>/router): Error code: 400 - {'object': 'error', 'message': 'allowed_token_ids contains out-of-vocab token id!', 'type': 'BadRequestError', 'param': None, 'code': 400}
Failed Accelerator: NVIDIA
Failed Cloud providers: IBM Cloud, AWS
Device Info (please complete the following information):
CHAT_MODEL= TEACHER_MODEL= TEACHER_MODEL_FAMILY=mixtral STARTER_MODEL= MODEL_SOURCE=s3 REGISTRY=registry.stage.redhat.io REGISTRY_NAMESPACE=rhelai1 BOOTC_IMAGE=bootc-nvidia-rhel9 INSTRUCTLAB_IMAGE=instructlab-nvidia-rhel9 registry.redhat.io/rhelai1/instructlab-nvidia-rhel9 1.5.0 306b4cceb073 BOOTC_IMAGE_VERSION=1.5 INSTANCE_NAME=rhelai-ci-runner MODEL_VERSION=1.5 INSTANCE_TYPE=g6e.12xlarge ALLOW_SIMILAR_INSTANCE_TYPES=true SYSTEM_PROFILE= INSTALL_TYPE=bootc_install CLOUD_TYPE=aws TERMINATE_INSTANCE=true USER_NAME=ec2-user TAXONOMY_REPO=https://github.com/RedHatOfficial/rhelai-sample-taxonomy GRANITE_MODEL_PARAMS=8b GRANITE_VERSION=3.1 GRANITE_VERSION_SUFFIX=v2 NUM_EPOCHS_PHASE_1=1 NUM_EPOCHS_PHASE_2=1 INSTRUCTLAB_VERSION= PUBLIC_DNS= EVAL_ONLY=false SDG_ONLY=true CHAT_ONLY=false
Bug impact
- It blocks the SDG phase on NVIDIA and potentially on other accelerators as well.
Known workaround
- TBD
Additional context
- osilkin@redhat.com looked briefly at it and provided the below context
it seems like the tokens aren’t in the model’s config which is really strange.
Looking at the logs it seems like it's failing to import from triton when loading vLLM. Have you tried installing an older version?
ERROR 05-01 22:01:45 [registry.py:347] <frozen runpy>:128: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour ERROR 05-01 22:01:45 [registry.py:347] Traceback (most recent call last): ERROR 05-01 22:01:45 [registry.py:347] File "/opt/app-root/lib64/python3.11/site-packages/torch/_inductor/runtime/hints.py", line 46, in <module> ERROR 05-01 22:01:45 [registry.py:347] from triton.backends.compiler import AttrsDescriptor ERROR 05-01 22:01:45 [registry.py:347] ImportError: cannot import name 'AttrsDescriptor' from 'triton.backends.compiler' (/opt/app-root/lib64/python3.11/site-packages/triton/backends/compiler.py)
- blocks
-
RHELAI-3668 vllm Inference is broken in SDG Downstream Agentic Pipeline
-
- Verified
-
- mentioned on
(2 mentioned on)