Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-2603

Decouple docling output/data ingestion from SDG pipeline

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • 0% To Do, 100% In Progress, 0% Done

      Feature Overview (mandatory - Complete while in New status)

      • The docling "artifact" is an ephemeral step today and tight to the monolith SDG pipeline.  
      • The goal of this feature is to decouple data ingestion from SDG, and move to core. 

      Goals (mandatory - Complete while in New status)

      • [Enhanced usage/Reusability]Docling output artifact can be consumed by other work streams like RAG
      • [Modularity] Docling output artifact can be generated independent of core sdg functionality
      • [Latency] Docling output artifact is generated just once, and consumed for SDG preprocessing and RAG, essentially both using same docling versions (preferably hybrid chunking) and saving inference time to run docling twice for a user interested in both SDG and RAG

      Requirements (mandatory -_ Complete while in Refinement status):

      1. Docling output generation logic is decoupled from SDG library, and moved to CLI. 
      2. Docling output can be by invoked from CLI . 

      Done - Acceptance Criteria (mandatory - Complete while in Refinement status):

      • Data Ingestion exists as a standalone functionality within core, that handles docling, and ingestion's output can be used for SDG and RAG.  

      Documentation Considerations {}{}(Initial completion while in Refinement status):

      • Document CLI usage for data ingestion 

      Background and Strategic Fit (Initial completion while in Refinement status):

       [Ben]

      SDG outputs the legacy docling json today, so at a minimum you need the real docling json. And, we don't do preprocessing separately from generation so the only way to get the preprocessing outputs you need today is to run the entire generation pipeline only to consume the preprocessing artifacts vs the generated samples.

       

      [Bill]

      I would like to have the available metadata with that real docling json too if possible, e.g., the URL that it was fetched from.  I guess the metadata can wait until 1.5 if necessary but we do need the real Docling JSON ASAP.

       

      Customer Considerations {}{}(Initial completion while in Refinement status):

      In future, output of data ingestion can be used by internal work streams - RAG, UI, SDG when ingestion moves to core. Each work-stream uses its own chunking strategy. 

      • RAG workstream 
      • UI workstream
      • SDG workstream also becomes a customer and consumes output of data ingestion

       

      Relevant Documents/Discussions:

      https://docs.google.com/document/d/1K0XmwOhRRqFsFZdVApX_iqANBhtkmViDx10BHrmjTIU/edit?tab=t.0

      Recording: [*https://drive.google.com/file/d/1_C06BfCryaDTf0cDYqfOeJy-L0wMtVUg/view?usp=sharing*]

       

       

              rh-ee-asaluja Aditi Saluja
              rh-ee-asaluja Aditi Saluja
              Ben Browning, Jehlum Vitasta Pandit
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: