-
Task
-
Resolution: Unresolved
-
Major
-
None
-
None
This came up in a Twitter discussion: it could be useful to have an SMT which externalizes large BLOB/CLOB column values and propagates the external reference in data change events. The motivation is to avoid large messages in Apache Kafka.
One particular implementation could be based on Amazon S3: when creating a change event, the values of any configured large columns would be written to the S3 object storage, and in change events the corresponding field value would describe bucket name and object id. The object id should be based on the offset of the change event (and the column name, and optionally a before/after flag), so that the same id would be used when processing the same offset a second time after a connector restart.
Consumers might either resolve the reference, retrieve the referenced object and persist that in a sink datastore. More common usage would probably simply persist the object reference, pushing the object retrieval to readers of that sink datastore.
Eventually, multiple object stores may be supported, but S3 will be a good candidate for an initial PoC.