The S3BinaryStore currently stores extracted text as "user metadata" of S3 objects.
This causes a limitation of maximum 2 KB of available space.
This is IMHO a very low boundary, given that I have lots of documents with an extracted text in the range of 100 KB.
I suggest to switch to storing extracted text as dedicated objects in S3, for example with a /extracted-text key suffix:
I don't foresee any particular issue with the implementation.
There's only the getAllBinaryKeys() method which I don't know how to implement with the current API of the S3 SDK: S3Objects doesn't support setting a "delimiter", which is needed to filter "xxx/extracted-text" entries from the listing.
I've opened a pull request in that regard: https://github.com/aws/aws-sdk-java/pull/1132