Uploaded image for project: 'ModeShape'
  1. ModeShape
  2. MODE-2701

Use dedicated objects for extracted text in S3 binary storage

    XMLWordPrintable

Details

    • Enhancement
    • Resolution: Unresolved
    • Major
    • 5.5
    • None
    • Storage

    Description

      The S3BinaryStore currently stores extracted text as "user metadata" of S3 objects.
      This causes a limitation of maximum 2 KB of available space.
      This is IMHO a very low boundary, given that I have lots of documents with an extracted text in the range of 100 KB.

      I suggest to switch to storing extracted text as dedicated objects in S3, for example with a /extracted-text key suffix:

      ...
      751f58c75aba8627e4d5b591aa7ceec5413c6a6a
      751f58c75aba8627e4d5b591aa7ceec5413c6a6a/extracted-text
      77355f3b1916329e7abd7e17987d543a09c36471
      77355f3b1916329e7abd7e17987d543a09c36471/extracted-text
      77705003b5ea10bc6664af107242af37bfef7115
      77705003b5ea10bc6664af107242af37bfef7115
      ...
      

      I don't foresee any particular issue with the implementation.
      There's only the getAllBinaryKeys() method which I don't know how to implement with the current API of the S3 SDK: S3Objects doesn't support setting a "delimiter", which is needed to filter "xxx/extracted-text" entries from the listing.
      I've opened a pull request in that regard: https://github.com/aws/aws-sdk-java/pull/1132

      Attachments

        Activity

          People

            Unassigned Unassigned
            dalbani Damiano Albani (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: