Uploaded image for project: 'ModeShape'
  1. ModeShape
  2. MODE-2701

Use dedicated objects for extracted text in S3 binary storage

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: Major Major
    • 5.5
    • None
    • Storage

      The S3BinaryStore currently stores extracted text as "user metadata" of S3 objects.
      This causes a limitation of maximum 2 KB of available space.
      This is IMHO a very low boundary, given that I have lots of documents with an extracted text in the range of 100 KB.

      I suggest to switch to storing extracted text as dedicated objects in S3, for example with a /extracted-text key suffix:

      ...
      751f58c75aba8627e4d5b591aa7ceec5413c6a6a
      751f58c75aba8627e4d5b591aa7ceec5413c6a6a/extracted-text
      77355f3b1916329e7abd7e17987d543a09c36471
      77355f3b1916329e7abd7e17987d543a09c36471/extracted-text
      77705003b5ea10bc6664af107242af37bfef7115
      77705003b5ea10bc6664af107242af37bfef7115
      ...
      

      I don't foresee any particular issue with the implementation.
      There's only the getAllBinaryKeys() method which I don't know how to implement with the current API of the S3 SDK: S3Objects doesn't support setting a "delimiter", which is needed to filter "xxx/extracted-text" entries from the listing.
      I've opened a pull request in that regard: https://github.com/aws/aws-sdk-java/pull/1132

              Unassigned Unassigned
              dalbani Damiano Albani (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: