Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3255

SDG fails with: 'struct fields don't match or are in the wrong order'

XMLWordPrintable

    • Approved

      To Reproduce Steps to reproduce the behavior:

      1. On 1.4-rc0 image in GCP ( 8xH100)  ( registry.stage.redhat.io/rhelai1/bootc-gcp-nvidia-rhel9:1.4-1738349195 ) download the models:
      2. ilab model download --repository docker://registry.stage.redhat.io/rhelai1/granite-8b-lab-v1 --release 1.4
        ilab model download --repository docker://registry.stage.redhat.io/rhelai1/skills-adapter-v3 --release 1.4
        ilab model download --repository docker://registry.stage.redhat.io/rhelai1/knowledge-adapter-v3 --release 1.4
        ilab model download --repository docker://registry.stage.redhat.io/rhelai1/mixtral-8x7b-instruct-v0-1 --release 1.4
        ilab model download --repository docker://registry.stage.redhat.io/rhelai1/prometheus-8x7b-v2-0 --release 1.4
        ilab model download --repository docker://registry.stage.redhat.io/rhelai1/granite-8b-starter-v1 --release 1.4
      1. Run SDG
      2. After about 75 minutes ( ran command like: `time ilab data generate` ) SDG stops with this error:
      3. INFO 2025-02-03 11:07:04,715 instructlab.sdg.datamixing:43: Rebalancing dataset to have 10395 samples ...
        Map (num_proc=8): 100%|##########| 10395/10395 [00:08<00:00, 1294.75 examples/s]
        INFO 2025-02-03 11:07:24,745 instructlab.model.backends.vllm:494: Waiting for GPU VRAM reclamation...
        failed to generate data with exception: struct fields don't match or are in the wrong order: Input fields: struct<content: string, role: string> output fields: struct<role: string, content: 
        string>
      1. The dataset contents:
      2. [cloud-user@ecosystem-qe-2 2025-02-03_095244]$ pwd
        /var/home/cloud-user/.local/share/instructlab/datasets/2025-02-03_095244
        [cloud-user@ecosystem-qe-2 2025-02-03_095244]$ ls -lsa
        total 9048
           4 drwxr-xr-x. 5 cloud-user cloud-user    4096 Feb  3 11:06 .
           0 drwxr-xr-x. 4 cloud-user cloud-user      50 Feb  3 10:00 ..
           4 drwxr-xr-x. 2 cloud-user cloud-user    4096 Feb  3 11:05 generated_2025-02-03T09_55_05
           4 -rw-r--r--. 1 cloud-user cloud-user     471 Feb  3 11:06 knowledge_recipe_2025-02-03T09_55_05.yaml
        4324 -rw-r--r--. 1 cloud-user cloud-user 4425768 Feb  3 11:06 messages_2025-02-03T09_55_05.jsonl
           4 drwxr-xr-x. 2 cloud-user cloud-user    4096 Feb  3 11:06 node_datasets_2025-02-03T09_55_05
           4 drwxr-xr-x. 3 cloud-user cloud-user    4096 Feb  3 09:55 preprocessed_2025-02-03T09_55_05
           4 -rw-r--r--. 1 cloud-user cloud-user     911 Feb  3 11:06 skills_recipe_2025-02-03T09_55_05.yaml
         932 -rw-r--r--. 1 cloud-user cloud-user  950538 Feb  3 09:55 test_2025-02-03T09_55_05.jsonl
        3768 -rw-r--r--. 1 cloud-user cloud-user 3856014 Feb  3 11:06 train_2025-02-03T09_55_05.jsonl

      Expected behavior

      • SDG to complete successfully 

      Screenshots

      • Attached Image

      Device Info (please complete the following information):

      • Hardware Specs: GCP a3-highgpu-8g
      • OS Version: RHEL AI 1.4 rc0
      • InstructLab Version: ilab, version 0.23.1
      • Provide the output of these two commands:
      • sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image, should look like 
      • registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.3-1732894187
      •  
      •  
        [cloud-user@ecosystem-qe-2 2025-02-03_095244]$ ilab system info                                                                 
        ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no                                                                                   
        ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no                                                                                   
        ggml_cuda_init: found 8 CUDA devices:                                                                                         
          Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
          Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
          Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
          Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
          Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
          Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
          Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
          Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes                                                           
        Platform:                                                                                                                     
          sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]                                 
          sys.platform: linux                                                                                                         
          os.name: posix                                                                                                             
          platform.release: 5.14.0-427.50.1.el9_4.x86_64                                                                             
          platform.machine: x86_64                                                                                                   
          platform.node: ecosystem-qe-2                                                  
          platform.python_version: 3.11.7                                                                                             
          os-release.ID: rhel                                                                                                          
          os-release.VERSION_ID: 9.4                                                                                                 
          os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)                                                                 
          memory.total: 1842.58 GB                                                                                                   
          memory.available: 1831.53 GB                                                                                               
          memory.used: 2.89 GB                                                                                                       
        InstructLab:
          instructlab.version: 0.23.1
          instructlab-dolomite.version: 0.2.0
          instructlab-eval.version: 0.5.1
          instructlab-quantize.version: 0.1.0
          instructlab-schema.version: 0.4.2
          instructlab-sdg.version: 0.7.0
          instructlab-training.version: 0.7.0 Torch:
          torch.version: 2.5.1
          torch.backends.cpu.capability: AVX512
          torch.version.cuda: 12.4
          torch.version.hip: None
          torch.cuda.available: True
          torch.backends.cuda.is_built: True
          torch.backends.mps.is_built: False
          torch.backends.mps.is_available: False
          torch.cuda.bf16: True
          torch.cuda.current.device: 0
          torch.cuda.0.name: NVIDIA H100 80GB HBM3
          torch.cuda.0.free: 78.6 GB
          torch.cuda.0.total: 79.1 GB
          torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
          torch.cuda.1.name: NVIDIA H100 80GB HBM3
          torch.cuda.1.free: 78.6 GB
          torch.cuda.1.total: 79.1 GB
          torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
          torch.cuda.2.name: NVIDIA H100 80GB HBM3
          torch.cuda.2.free: 78.6 GB
          torch.cuda.2.total: 79.1 GB
          torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
          torch.cuda.3.name: NVIDIA H100 80GB HBM3
          torch.cuda.3.free: 78.6 GB
          torch.cuda.3.total: 79.1 GB
          torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
          torch.cuda.4.name: NVIDIA H100 80GB HBM3
          torch.cuda.4.free: 78.6 GB
          torch.cuda.4.total: 79.1 GB
          torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
          torch.cuda.5.name: NVIDIA H100 80GB HBM3
          torch.cuda.5.free: 78.6 GB
          torch.cuda.5.total: 79.1 GB
          torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
          torch.cuda.6.name: NVIDIA H100 80GB HBM3
          torch.cuda.6.free: 78.6 GB
          torch.cuda.6.total: 79.1 GB
          torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        
          torch.cuda.7.name: NVIDIA H100 80GB HBM3
          torch.cuda.7.free: 78.6 GB
          torch.cuda.7.total: 79.1 GB
          torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        llama_cpp_python:
          llama_cpp_python.version: 0.3.2
          llama_cpp_python.supports_gpu_offload: True
        
        •  

                                 

      Bug impact

      • Please provide information on the impact of this bug to the end user.

      Known workaround

      • Please add any known workarounds.

      Additional context

      • <your text here>
      • ...

              osilkin@redhat.com Oleg Silkin
              cvultur@redhat.com Constantin Daniel Vultur
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: