Uploaded image for project: 'ModeShape'
  1. ModeShape
  2. MODE-2701

Use dedicated objects for extracted text in S3 binary storage

    Details

      Description

      The S3BinaryStore currently stores extracted text as "user metadata" of S3 objects.
      This causes a limitation of maximum 2 KB of available space.
      This is IMHO a very low boundary, given that I have lots of documents with an extracted text in the range of 100 KB.

      I suggest to switch to storing extracted text as dedicated objects in S3, for example with a /extracted-text key suffix:

      ...
      751f58c75aba8627e4d5b591aa7ceec5413c6a6a
      751f58c75aba8627e4d5b591aa7ceec5413c6a6a/extracted-text
      77355f3b1916329e7abd7e17987d543a09c36471
      77355f3b1916329e7abd7e17987d543a09c36471/extracted-text
      77705003b5ea10bc6664af107242af37bfef7115
      77705003b5ea10bc6664af107242af37bfef7115
      ...
      

      I don't foresee any particular issue with the implementation.
      There's only the getAllBinaryKeys() method which I don't know how to implement with the current API of the S3 SDK: S3Objects doesn't support setting a "delimiter", which is needed to filter "xxx/extracted-text" entries from the listing.
      I've opened a pull request in that regard: https://github.com/aws/aws-sdk-java/pull/1132

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                dalbani Damiano Albani
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated: