-
Epic
-
Resolution: Done
-
Critical
-
None
-
Additional tesseract-langpack
-
False
-
-
False
-
In Progress
-
AIPCC-1837 - multi-language support for vllm and instructlab
-
-
0% To Do, 0% In Progress, 100% Done
InstructLab uses Docling to process and chunk documents. Docling depends on an OCR engine to convert images to text, e.g. in PDFs with embedded images. RHELAI uses the Tesseract OCR engine. The Tesseract RPM package is in RHEL 9.
InstructLab Multilingual Model Support adds support for other languages like French, German, Italian, and Spanish. The Tesseract package in RHEL 9 only comes with tesseract-langpack-eng. The additional langpack RPMs are built but then excluded in Errata's product listing. See tesseract-tessdata erratum https://errata.devel.redhat.com/advisory/91911/builds
Investigate how we can provide the required langpacks in our layered product:
- Can we ship the langpack RPMs of build tesseract-tessdata-4.1.0-3.el9 in our layered product?
- Do we need a new build of tesseract-tessdata?
Should we build latest version of Tesseract? RHEL 9 has tesseract-4.1.1 with leptonica-1.80. Latest versions in Fedora are tesseract-5.5.0 with leptonica-1.85. tessdata is on 4.1.0 everywhere.(not required at the moment)
Goals:
- Primary: Deliver language packs for at least French, German, Italian, and Spanish in RHELAI 1.5 application images. Packages must be available for installation before 2025-04-08 (RHELAI 1.5 RPM freeze date). RHELAI 1.5 will be based on RHEL 9.4 EUS.
- Secondary: Agree on long-term plans for additional language packs for RHEL 9.6 and RHEL 10 (delivery, maintenance, QE work).
- is related to
-
RHEL-50647 No tesseract langpacks other than English are available
-
- In Progress
-
- relates to
-
AIPCC-669 Investigate if AIPCC should upgrade tesseract in el9
-
- Closed
-