Uploaded image for project: 'ModeShape'
  1. ModeShape
  2. MODE-1560

Tika Text extractor cannot extract content from MSOffice files if used together with sequencer

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 2.8.1.GA, 2.8.2.Final
    • Fix Version/s: 2.8.3.Final
    • Component/s: None
    • Labels:
      None

      Description

      When configuring a repository with a MsOfficeSequencer and a TikaTextExtractor, the tika extractor cannot extract content from MsOffice files.

      This is caused by the fact that the MsOffice sequencer enforces an apache-poi dependency version of 3.7, while the tika-parsers_1.0 library needs at least a beta version of 3.8 to be able to extract content from office documents. (this uses the NPOIFS* classes from POI, which aren't present in 3.7)

      The downside of this, is that the error is well hidden, because any potential problems during text extraction (see org.modeshape.search.lucene.LuceneSearchSession) are silently ignored.

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                hchiorean Horia Chiorean
                Reporter:
                hchiorean Horia Chiorean
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: