-
Enhancement
-
Resolution: Unresolved
-
Major
-
None
The S3BinaryStore currently stores extracted text as "user metadata" of S3 objects.
This causes a limitation of maximum 2 KB of available space.
This is IMHO a very low boundary, given that I have lots of documents with an extracted text in the range of 100 KB.
I suggest to switch to storing extracted text as dedicated objects in S3, for example with a /extracted-text key suffix:
... 751f58c75aba8627e4d5b591aa7ceec5413c6a6a 751f58c75aba8627e4d5b591aa7ceec5413c6a6a/extracted-text 77355f3b1916329e7abd7e17987d543a09c36471 77355f3b1916329e7abd7e17987d543a09c36471/extracted-text 77705003b5ea10bc6664af107242af37bfef7115 77705003b5ea10bc6664af107242af37bfef7115 ...
I don't foresee any particular issue with the implementation.
There's only the getAllBinaryKeys() method which I don't know how to implement with the current API of the S3 SDK: S3Objects doesn't support setting a "delimiter", which is needed to filter "xxx/extracted-text" entries from the listing.
I've opened a pull request in that regard: https://github.com/aws/aws-sdk-java/pull/1132