Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2712

Traceback prints while exiting the VllmWorkerProcess process

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Undefined Undefined
    • None
    • rhelai-1.3
    • InstructLab - Core
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Got the following Python Traceback prints while trying to exit the VllmWorkerProcess process during training:

              "100% 80/80 [00:48<00:00,  1.64it/s]",
              "INFO 12-12 23:15:24 launcher.py:57] Shutting down FastAPI HTTP server.",
              "INFO 12-12 23:15:24 multiproc_worker_utils.py:137] Terminating local vLLM worker processes",
              " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting",
              " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting",
              " INFO 12-12 23:15:24 multiproc_worker_utils.py:244] Worker exiting",
              " Process VllmWorkerProcess:",
              " Traceback (most recent call last):",
              "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 314, in _bootstrap",
              "     self.run()",
              "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 108, in run",
              "     self._target(*self._args, **self._kwargs)",
              "   File \"/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py\", line 384, in signal_handler",
              "     raise KeyboardInterrupt(\"MQLLMEngine terminated\")",
              " KeyboardInterrupt: MQLLMEngine terminated",
              " Process VllmWorkerProcess:",
              " Traceback (most recent call last):",
              "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 314, in _bootstrap",
              "     self.run()",
              "   File \"/usr/lib64/python3.11/multiprocessing/process.py\", line 108, in run",
              "     self._target(*self._args, **self._kwargs)",
              "   File \"/opt/app-root/lib64/python3.11/site-packages/vllm/engine/multiprocessing/engine.py\", line 384, in signal_handler",
              "     raise KeyboardInterrupt(\"MQLLMEngine terminated\")",
              " KeyboardInterrupt: MQLLMEngine terminated",
              ":     Shutting down",
              ":     Waiting for application shutdown.",
              ":     Application shutdown complete.",
              "/usr/lib64/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown", 

      The training process itself was completed successfully!

      Env info:

      [cloud-user@nvd-srv-28 ~]$ sudo bootc status
      apiVersion: org.containers.bootc/v1alpha1
      kind: BootcHost
      metadata:
        name: host
      spec:
        image:
          image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952
          transport: registry
        bootOrder: default
      status:
        staged: null
        booted:
          image:
            image:
              image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3.1-1734019952
              transport: registry
            version: 9.20241104.0
            timestamp: null
            imageDigest: sha256:407e27fd481d4e329b2feefba4f56470fd043931f205a313f6427c3f80c3a890
          cachedUpdate: null
          incompatible: false
          pinned: false
          store: ostreeContainer
          ostree:
            checksum: aa4198c1e3fec5f70dbdad2cc7c35557e54a931397687fdd0fdeab9224797047
            deploySerial: 0
        rollback:
          image:
            image:
              image: registry.stage.redhat.io/rhelai1/bootc-nvidia-rhel9:1.3
              transport: registry
            version: 9.20241104.0
            timestamp: null
            imageDigest: sha256:1300ac8b47dd05579b84254dcc369acd8e4f02d06a1372eb801aa71150250d91
          cachedUpdate: null
          incompatible: false
          pinned: false
          store: ostreeContainer
          ostree:
            checksum: 05cc1a174bc8745ecf84aa8d5ca23969890a3bfb0c4bb33549e263c1c273af62
            deploySerial: 0
        rollbackQueued: false
        type: bootcHost 
      [cloud-user@nvd-srv-28 ~]$ sudo podman images --format json
      [
          {
              "Id": "5a5e2ed36334199b762eeffd293099d96b51df0f6ef65bb503be0ddbb0d45326",
              "ParentId": "",
              "RepoTags": null,
              "RepoDigests": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9@sha256:525ab53de3829cac1a9aabb73194f49e22da8fdcf12a01c56ece961300cdab0d"
              ],
              "Size": 18138403565,
              "SharedSize": 0,
              "VirtualSize": 18138403565,
              "Labels": {
                  "WHEEL_RELEASE": "v1.3.1183+rhelai-cuda-ubi9",
                  "architecture": "x86_64",
                  "build-date": "2024-12-11T21:09:57",
                  "com.redhat.component": "ubi9-container",
                  "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI",
                  "description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
                  "distribution-scope": "public",
                  "io.buildah.version": "1.38.0-dev",
                  "io.k8s.description": "The Universal Base Image is designed and engineered to be the base layer for all of your containerized applications, middleware and utilities. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.",
                  "io.k8s.display-name": "Red Hat Universal Base Image 9",
                  "io.openshift.expose-services": "",
                  "io.openshift.tags": "base rhel9",
                  "maintainer": "Red Hat, Inc.",
                  "name": "ubi9",
                  "org.opencontainers.image.vendor": "Red Hat, Inc.",
                  "release": "1214.1729773476",
                  "summary": "Provides the latest release of Red Hat Universal Base Image 9.",
                  "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9/images/9.4-1214.1729773476",
                  "vcs-ref": "e212c7252addad2e5342212f9ed3ac41bc462c87",
                  "vcs-type": "git",
                  "vendor": "Red Hat, Inc.",
                  "version": "9.4"
              },
              "Containers": 0,
              "ReadOnly": true,
              "Names": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3.1-1733951397"
              ],
              "Digest": "sha256:525ab53de3829cac1a9aabb73194f49e22da8fdcf12a01c56ece961300cdab0d",
              "History": [
                  "registry.stage.redhat.io/rhelai1/instructlab-nvidia-rhel9:1.3.1-1733951397"
              ],
              "Created": 1733953958,
              "CreatedAt": "2024-12-11T21:52:38Z"
          }
      ] 
      [cloud-user@nvd-srv-28 ~]$ ilab system info
      Platform:
        sys.version: 3.11.7 (main, Oct  9 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.42.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 250.89 GB
        memory.available: 239.24 GB
        memory.used: 9.88 GBInstructLab:
        instructlab.version: 0.21.2
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.4.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.1
        instructlab-sdg.version: 0.6.1
        instructlab-training.version: 0.6.1Torch:
        torch.version: 2.4.1
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: 12.4
        torch.version.hip: None
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: NVIDIA L40S
        torch.cuda.0.free: 44.1 GB
        torch.cuda.0.total: 44.5 GB
        torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: NVIDIA L40S
        torch.cuda.1.free: 44.1 GB
        torch.cuda.1.total: 44.5 GB
        torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: NVIDIA L40S
        torch.cuda.2.free: 44.1 GB
        torch.cuda.2.total: 44.5 GB
        torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: NVIDIA L40S
        torch.cuda.3.free: 44.1 GB
        torch.cuda.3.total: 44.5 GB
        torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
        llama_cpp_python.version: 0.2.79
        llama_cpp_python.supports_gpu_offload: True 

       

              Unassigned Unassigned
              aopincar Ariel Opincaru
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: