Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: rhelai-1.5
Affects Version/s: rhelai-1.5
Component/s: Accelerators - NVIDIA, InstructLab - SDG
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Go to DIIP and run a new SDG_ONLY pipeline with latest RHELAI 1.5 bits
Watch the run

DIIP run logs : https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9942717767

IBM Cloud failure logs: https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9936008492

Expected behavior

SDG phase should run successfully and it should generate the data as expected.
But it fails with the below error

failed to generate data with exception: PipelineBlockError(<class 'instructlab.sdg.blocks.llmblock.LLMBlock'>/router): Error code: 400 - {'object': 'error', 'message': 'allowed_token_ids contains out-of-vocab token id!', 'type': 'BadRequestError', 'param': None, 'code': 400}

Failed Accelerator: NVIDIA

Failed Cloud providers: IBM Cloud, AWS

Device Info (please complete the following information):

CHAT_MODEL=
TEACHER_MODEL=
TEACHER_MODEL_FAMILY=mixtral
STARTER_MODEL=
MODEL_SOURCE=s3
REGISTRY=registry.stage.redhat.io
REGISTRY_NAMESPACE=rhelai1
BOOTC_IMAGE=bootc-nvidia-rhel9
INSTRUCTLAB_IMAGE=instructlab-nvidia-rhel9
registry.redhat.io/rhelai1/instructlab-nvidia-rhel9  1.5.0 306b4cceb073
BOOTC_IMAGE_VERSION=1.5
INSTANCE_NAME=rhelai-ci-runner
MODEL_VERSION=1.5
INSTANCE_TYPE=g6e.12xlarge
ALLOW_SIMILAR_INSTANCE_TYPES=true
SYSTEM_PROFILE=
INSTALL_TYPE=bootc_install
CLOUD_TYPE=aws
TERMINATE_INSTANCE=true
USER_NAME=ec2-user
TAXONOMY_REPO=https://github.com/RedHatOfficial/rhelai-sample-taxonomy
GRANITE_MODEL_PARAMS=8b
GRANITE_VERSION=3.1
GRANITE_VERSION_SUFFIX=v2
NUM_EPOCHS_PHASE_1=1
NUM_EPOCHS_PHASE_2=1
INSTRUCTLAB_VERSION=
PUBLIC_DNS=
EVAL_ONLY=false
SDG_ONLY=true
CHAT_ONLY=false

Bug impact

It blocks the SDG phase on NVIDIA and potentially on other accelerators as well.

Known workaround

TBD

Additional context

osilkin@redhat.com looked briefly at it and provided the below context

it seems like the tokens aren’t in the model’s config which is really strange.
Looking at the logs it seems like it's failing to import from triton when loading vLLM. Have you tried installing an older version?

ERROR 05-01 22:01:45 [registry.py:347] <frozen runpy>:128: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
ERROR 05-01 22:01:45 [registry.py:347] Traceback (most recent call last):
ERROR 05-01 22:01:45 [registry.py:347]   File "/opt/app-root/lib64/python3.11/site-packages/torch/_inductor/runtime/hints.py", line 46, in <module>
ERROR 05-01 22:01:45 [registry.py:347]     from triton.backends.compiler import AttrsDescriptor
ERROR 05-01 22:01:45 [registry.py:347] ImportError: cannot import name 'AttrsDescriptor' from 'triton.backends.compiler' (/opt/app-root/lib64/python3.11/site-packages/triton/backends/compiler.py)