-
Bug
-
Resolution: Not a Bug
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
-
Proposed
[2693022404] Upstream Reporter: Tyler Lisowski
Upstream issue status: Closed
Upstream description:
I have not been able to get a lower level debug log: but when trying to run sdg on a sample PDF document on RHEL AI: this function will hang indefinitely:
Steps to reproduce:
- Get on rhel ai and run ilab data generate on a pdf taxonomy. The example I used is here: https://github.com/relyt0925/taxonomy-doclingpoc/tree/main
- Look at logs: when resolve_ocr_options is ran the process will hang indefinitely at
INFO 2024-11-26 03:22:51,619 instructlab.sdg.utils.taxonomy:147: Processing files... INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:153: Pattern 'phoenix.pdf' matched 1 files. INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:157: Processing file: /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf INFO 2024-11-26 03:22:51,620 instructlab.sdg.utils.taxonomy:172: Loading PDF document from /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf INFO 2024-11-26 03:22:51,622 instructlab.sdg.utils.taxonomy:182: PDF '/root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf' has 6 pages. INFO 2024-11-26 03:22:56,486 instructlab.sdg.utils.taxonomy:218: Unloaded PDF document: /root/outputdir/documents-2024-11-26T03_22_51/phoenix.pdf INFO 2024-11-26 03:22:59,815 instructlab.sdg.generate_data:408: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help. INFO 2024-11-26 03:23:01,883 instructlab.sdg.utils.chunkers:393: Successfully loaded tokenizer from: /instructlab/models/mixtral-8x7b-instruct-v0-1 INFO 2024-11-26 03:23:05,050 instructlab.sdg.utils.chunkers:255: Found the docling modelsI built a custom image commenting out that section with a custom SDG patch: https://github.com/relyt0925/sdg/commit/08343204e6fda0ae5473f9e99a8b77271ca77bde and then reran it and we are able to get to the point of processing documents
time="2024-11-26T04:09:27Z" level=warning msg="The input device is not a TTY. The --tty and --interactive flags might not work properly" INFO 2024-11-26 04:09:29,412 numexpr.utils:145: Note: detected 80 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. INFO 2024-11-26 04:09:29,413 numexpr.utils:148: Note: NumExpr detected 80 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16. INFO 2024-11-26 04:09:29,413 numexpr.utils:161: NumExpr defaulting to 16 threads. INFO 2024-11-26 04:09:30,512 datasets:59: PyTorch version 2.4.1 available. INFO 2024-11-26 04:09:32,013 instructlab.data.generate_data:87: Generating synthetic data using '/usr/share/instructlab/sdg/pipelines/agentic' pipeline, '/instructlab/models/mixtral-8x7b-instruct-v0-1' model, '/root/taxonomy-doclingpoc/' taxonomy, against https://781d2e7c-us-east.lb.appdomain.cloud/v1 server INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:147: Processing files... INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:153: Pattern 'phoenix.pdf' matched 1 files. INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:157: Processing file: /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf INFO 2024-11-26 04:09:32,401 instructlab.sdg.utils.taxonomy:172: Loading PDF document from /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf INFO 2024-11-26 04:09:32,404 instructlab.sdg.utils.taxonomy:182: PDF '/root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf' has 6 pages. INFO 2024-11-26 04:09:37,265 instructlab.sdg.utils.taxonomy:218: Unloaded PDF document: /root/outputdir/documents-2024-11-26T04_09_32/phoenix.pdf INFO 2024-11-26 04:09:40,545 instructlab.sdg.generate_data:408: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help. INFO 2024-11-26 04:09:42,690 instructlab.sdg.utils.chunkers:393: Successfully loaded tokenizer from: /instructlab/models/mixtral-8x7b-instruct-v0-1 INFO 2024-11-26 04:09:45,790 instructlab.sdg.utils.chunkers:255: Found the docling models INFO 2024-11-26 04:09:46,050 docling.document_converter:202: Going to convert document batch...my custom test image is quay.io/relyt09250/testinstructlabbuilds:121withsdgpatch
Upstream URL: https://github.com/instructlab/sdg/issues/410
- relates to
-
RHELAI-2396 RHEL AI 1.3 Docling fails to load OCR support with missing system library
- Closed
- links to