Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2603

[ SDK API: Preprocessing Phase 1 ] Enable RAG POCs : Decouple docling output from SDG pipeline

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • 100% To Do, 0% In Progress, 0% Done

      Feature Overview (mandatory - Complete while in New status)
       
      When you split the document ingestion/pre-processing from SDG you will end up with [pre-processing] > (artifact in instructlab schema) > [sdg]. That "artifact" is an ephemeral step today and tight to the monolith SDG pipeline.

      This feature is to make sure we can generate that artifact without running the full sdg pipeline, only the pre-processing part of it. (which was already part of what is being worked when splitting functionality for sdg 2.0 but output of the artifact alone was skipped) 
       
      Goals (mandatory - Complete while in New status)

      • Docling output artifact can be consumed by other work streams like RAG
      • Docling output artifact can be generated independent of core sdg functionality

       

      Requirements (mandatory -_ Complete while in Refinement status):

      1. Docling output generation logic is moved to preprocessing API.
      2. Docling output can be used by invoking the preprocessing API instead of sdg run. 

      Done - Acceptance Criteria (mandatory - Complete while in Refinement status):

       

      Use Cases - i.e. User Experience & Workflow: (Initial completion while in Refinement status):

      • rag work stream wants to use docling output for MVP

      Out of Scope _{}(Initial completion while in Refinement status):{_}

      • Exposing preprocessing API knobs to end user via CLI

      Documentation Considerations _{}(Initial completion while in Refinement status):{_}

      • For end user, there should not be significant considerations here. 

       

      Questions to Answer _{}(Initial completion while in Refinement status):{_}

       

      Does it make sense to scope this work in phase 1 ?

      What will be left for phase 2  ?

       

      Background and Strategic Fit (Initial completion while in Refinement status):

       [Ben]

      SDG outputs the legacy docling json today, so at a minimum you need the real docling json. And, we don't do preprocessing separately from generation so the only way to get the preprocessing outputs you need today is to run the entire generation pipeline only to consume the preprocessing artifacts vs the generated samples.

       

      [Bill]

      I would like to have the available metadata with that real docling json too if possible, e.g., the URL that it was fetched from.  I guess the metadata can wait until 1.5 if necessary but we do need the real Docling JSON ASAP.

       

      Customer Considerations _{}(Initial completion while in Refinement status):{_}

      RAG workstream 

      UI workstream

       

              rh-ee-asaluja Aditi Saluja
              rh-ee-asaluja Aditi Saluja
              Ben Browning, Bill Murdock
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: