Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3653

SDG fails against markdown files due to older docling versions

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Apr 10:

      • Expecting confirmation from Courtney that we have everything we need u/s to make the d/s build by EOD April 10.
      • Expecting testing to start by April 14 and should be completed by April 21.
      Show
      Apr 10: Expecting confirmation from Courtney that we have everything we need u/s to make the d/s build by EOD April 10. Expecting testing to start by April 14 and should be completed by April 21.

      To Reproduce Steps to reproduce the behavior:

      Deploy RHEL AI 1.4.x onto a server with enough resources to complete the SDG run, initializing ilab correctly

      Error reproduced by Ben for the document shared by rhn-support-jharmiso :

       File "/home/bbrownin/tmp/docling-index-out-of-range/venv/lib/python3.11/site-packages/docling/pipeline/simple_pipeline.py", line 41, in _build_document
          conv_res.document = conv_res.input._backend.convert()
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/bbrownin/tmp/docling-index-out-of-range/venv/lib/python3.11/site-packages/docling/backend/md_backend.py", line 340, in convert
          self.iterate_elements(parsed_ast, 0, doc, None)
        File "/home/bbrownin/tmp/docling-index-out-of-range/venv/lib/python3.11/site-packages/docling/backend/md_backend.py", line 306, in iterate_elements
          self.iterate_elements(child, depth + 1, doc, parent_element)
        File "/home/bbrownin/tmp/docling-index-out-of-range/venv/lib/python3.11/site-packages/docling/backend/md_backend.py", line 166, in iterate_elements
          f" - Heading level {element.level}, content:

      {element.children[0].children}

      "
                                                        ~~~~~~~~~~~~~~~~^^^
      IndexError: list index out of range

      Error arising from older docling version. 

      Expected behavior

      • SDG pipeline should run successfully

      Device Info (please complete the following information):

        • All/Any

      Bug impact

      • Any time user tries md files with unescaped special characters that imply special markdown handling, such as * to indicate an unordered list item or # to indicate a header block, with no following content sdg will fail.

      Known workaround

      • avoid unescaped special characters in md
      • Ben proposed   - Avoid empty markdown headings by themselves for the document shared by field teams containing those characters. https://github.com/DS4SD/docling/pull/843 

       

              rh-ee-esivaram Eshwar Prasad Sivaramakrishnan
              rh-ee-asaluja Aditi Saluja
              Kamesh Akella Kamesh Akella
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: