Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4084

SDG generation phase fails with allowed_token_ids contains out-of-vocab token id error

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Approved

      To Reproduce Steps to reproduce the behavior:

      1. Go to DIIP and run a new SDG_ONLY pipeline with latest RHELAI 1.5 bits
      2. Watch the run

      DIIP run logs : https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9942717767

      IBM Cloud failure logs: https://gitlab.com/redhat/rhel-ai/diip/-/jobs/9936008492 

      Expected behavior

      • SDG phase should run successfully and it should generate the data as expected.
      • But it fails with the below error

       

      failed to generate data with exception: PipelineBlockError(<class 'instructlab.sdg.blocks.llmblock.LLMBlock'>/router): Error code: 400 - {'object': 'error', 'message': 'allowed_token_ids contains out-of-vocab token id!', 'type': 'BadRequestError', 'param': None, 'code': 400} 

       

      Failed Accelerator: NVIDIA

      Failed Cloud providers: IBM Cloud, AWS

      Device Info (please complete the following information):

       

      CHAT_MODEL=
      TEACHER_MODEL=
      TEACHER_MODEL_FAMILY=mixtral
      STARTER_MODEL=
      MODEL_SOURCE=s3
      REGISTRY=registry.stage.redhat.io
      REGISTRY_NAMESPACE=rhelai1
      BOOTC_IMAGE=bootc-nvidia-rhel9
      INSTRUCTLAB_IMAGE=instructlab-nvidia-rhel9
      registry.redhat.io/rhelai1/instructlab-nvidia-rhel9  1.5.0 306b4cceb073
      BOOTC_IMAGE_VERSION=1.5
      INSTANCE_NAME=rhelai-ci-runner
      MODEL_VERSION=1.5
      INSTANCE_TYPE=g6e.12xlarge
      ALLOW_SIMILAR_INSTANCE_TYPES=true
      SYSTEM_PROFILE=
      INSTALL_TYPE=bootc_install
      CLOUD_TYPE=aws
      TERMINATE_INSTANCE=true
      USER_NAME=ec2-user
      TAXONOMY_REPO=https://github.com/RedHatOfficial/rhelai-sample-taxonomy
      GRANITE_MODEL_PARAMS=8b
      GRANITE_VERSION=3.1
      GRANITE_VERSION_SUFFIX=v2
      NUM_EPOCHS_PHASE_1=1
      NUM_EPOCHS_PHASE_2=1
      INSTRUCTLAB_VERSION=
      PUBLIC_DNS=
      EVAL_ONLY=false
      SDG_ONLY=true
      CHAT_ONLY=false

       

      Bug impact

      • It blocks the SDG phase on NVIDIA and potentially on other accelerators as well.

      Known workaround

      • TBD

      Additional context

      it seems like the tokens aren’t in the model’s config which is really strange.
      Looking at the logs it seems like it's failing to import from triton when loading vLLM. Have you tried installing an older version?

      ERROR 05-01 22:01:45 [registry.py:347] <frozen runpy>:128: RuntimeWarning: 'vllm.model_executor.models.registry' found in sys.modules after import of package 'vllm.model_executor.models', but prior to execution of 'vllm.model_executor.models.registry'; this may result in unpredictable behaviour
      ERROR 05-01 22:01:45 [registry.py:347] Traceback (most recent call last):
      ERROR 05-01 22:01:45 [registry.py:347]   File "/opt/app-root/lib64/python3.11/site-packages/torch/_inductor/runtime/hints.py", line 46, in <module>
      ERROR 05-01 22:01:45 [registry.py:347]     from triton.backends.compiler import AttrsDescriptor
      ERROR 05-01 22:01:45 [registry.py:347] ImportError: cannot import name 'AttrsDescriptor' from 'triton.backends.compiler' (/opt/app-root/lib64/python3.11/site-packages/triton/backends/compiler.py) 

              bbrownin@redhat.com Ben Browning
              kakella@redhat.com Kamesh Akella
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: