Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2395

[instructlab/sdg] resolve_ocr_options() causes RHEL AI sdg with PDF to hang indefinitely

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Proposed

      [2693022404] Upstream Reporter: Tyler Lisowski
      Upstream issue status: Closed
      Upstream description:

      I have not been able to get a lower level debug log: but when trying to run sdg on a sample PDF document on RHEL AI: this function will hang indefinitely:

      Steps to reproduce:

      1. Get on rhel ai and run ilab data generate on a pdf taxonomy. The example I used is here: https://github.com/relyt0925/taxonomy-doclingpoc/tree/main
      2. Look at logs: when resolve_ocr_options is ran the process will hang indefinitely at
      INFO 2024-11-26 03:22:51,619 instructlab.sdg.utils.taxonomy:147: Processing files...
      INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:153: Pattern 'phoenix.pdf' matched 1 files.
      INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:157: Processing file: /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf
      INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:172: Loading PDF document from /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf
      INFO 2024-11-26 03:22:51,622 instructlab.sdg.utils.taxonomy:182: PDF '/root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf' has 6 pages.
      INFO 2024-11-26 03:22:56,486 instructlab.sdg.utils.taxonomy:218: Unloaded PDF document: /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf
      INFO 2024-11-26 03:22:59,815 instructlab.sdg.generate_data:408: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
      INFO 2024-11-26 03:23:01,883 instructlab.sdg.utils.chunkers:393: Successfully loaded tokenizer from: /instructlab/models/mixtral-8x7b-instruct-v0-1
      INFO 2024-11-26 03:23:05,050 instructlab.sdg.utils.chunkers:255: Found the docling models

      I built a custom image commenting out that section with a custom SDG patch: https://github.com/relyt0925/sdg/commit/08343204e6fda0ae5473f9e99a8b77271ca77bde and then reran it and we are able to get to the point of processing documents

      time="2024-11-26T04:09:27Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly"
      INFO 2024-11-26 04:09:29,412 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
      INFO 2024-11-26 04:09:29,413 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
      INFO 2024-11-26 04:09:29,413 numexpr.utils:161: NumExpr defaulting to 16 threads.
      INFO 2024-11-26 04:09:30,512 datasets:59: PyTorch version 2.4.1 available.
      INFO 2024-11-26 04:09:32,013 instructlab.data.generate_data:87: Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/root/taxonomy-doclingpoc/' taxonomy, against https://781d2e7c-us-east.lb.appdomain.cloud/v1 server
      INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:147: Processing files...
      INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:153: Pattern 'phoenix.pdf' matched 1 files.
      INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:157: Processing file: /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf
      INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:172: Loading PDF document from /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf
      INFO 2024-11-26 04:09:32,404 instructlab.sdg.utils.taxonomy:182: PDF '/root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf' has 6 pages.
      INFO 2024-11-26 04:09:37,265 instructlab.sdg.utils.taxonomy:218: Unloaded PDF document: /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf
      INFO 2024-11-26 04:09:40,545 instructlab.sdg.generate_data:408: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
      INFO 2024-11-26 04:09:42,690 instructlab.sdg.utils.chunkers:393: Successfully loaded tokenizer from: /instructlab/models/mixtral-8x7b-instruct-v0-1
      INFO 2024-11-26 04:09:45,790 instructlab.sdg.utils.chunkers:255: Found the docling models
      INFO 2024-11-26 04:09:46,050 docling.document_converter:202: Going to convert document batch...

      my custom test image is quay.io/relyt09250/testinstructlabbuilds:121withsdgpatch


      Upstream URL: https://github.com/instructlab/sdg/issues/410

              Unassigned Unassigned
              upstream-sync Upstream Sync
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: