While testing AWS Aurora postgres failover the task ends up in a failed state often. AWS claims a 30 second failover and Debezium has a 10 second hard coded wait time before restart after a retriable exception. Testing with some local Debezium code changes that allow for a longer wait time allow the failover to complete and the connection to PG is restored and Debezium picks up where it left off.
If a second failover occurs before the first connector restart happens the task can end up in a failed state. Some testing with local Debezium code changes that do connection retries have helped with this scenario.