-
Enhancement
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
False
-
Undefined
-
Cassandra CDC will enqueue an EOF (end-of-file) event to ChangeEventQueue after it finishes reading all mutations in a CommitLogFile. Since we only have one instance of ChangeEventQueue in Cassandra CDC, it's guaranteed that the EOF event comes after all change events of a CommitLogFile. When QueueProcessor polls out the EOF event of a CommitLogFile, it means that all change events of this CommitLogFile have been published successfully and Cassandra CDC will move this file into the success relocation folder from cdc_raw. However, if Cassandra CDC fails to publish the change event of a mutation, it will stop and the EOF event of the CommitLogFile which contains this mutation won't be move out of cdc_raw, which will potentially suspend writes into Cassandra DB.
To solve the potential P0 issue as described above, we'll want to make the following refactors in Cassandra CDC:
1). When Cassandra CDC fails to publish a change event, we should catch the exception and make Cassandra CDC keep processing other change events.
2). But 1) will generate a new problem that when QueueProcessor polls out the EOF event of a CommitLogFile, it's possible that some change events of this file are not published successfully, but this file will still be moved to success relocation folder and won't be re-processed.
3) To solve the problem described in 2), we might want to maintain a set/map in either CassandraConnectorContext/QueueProcessor/CommitLogPostProcessor. When QueueProcessor polls out an EOF event, it should firstly check if the name of the CommitLogFile is in the map/set, if yes, the file should be moved to error relocation folder for re-processing, otherwise, it should be moved to success relocation folder.