Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-6939

Vitess connector should retry on not found errors

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Major Major
    • 2.4.0.CR1
    • None
    • vitess-connector
    • None

      Feature request or enhancement

      For feature requests or enhancements, provide this information, please:

      Which use case/requirement will be addressed by the proposed feature?

      If a tablet in a specific cell is not found, we should retry, not kill the task. It may be just temporarily down/not discoverable. Otherwise, killing the task means it's stopped until manual intervention, but the tablet may have recovered on vitess side. So not retrying causes unnecessary downtime/lag/catch-up time.

      Exception seen that should be retried but is not currently: 

       

      2023-09-15 20:01:21,680 ERROR  Vitess|prod.byuser|streaming  Producer failure   [io.debezium.pipeline.ErrorHandler]
      io.grpc.StatusRuntimeException: NOT_FOUND: tablet: cell:"us_east_1e" uid:300240074 is either down or nonexistent
              at io.grpc.Status.asRuntimeException(Status.java:533)
              at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
              at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463)
              at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427)
              at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460)
              at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:616)
              at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:69)
              at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:802)
              at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:781)
              at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
              at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:829)
      2023-09-15 20:01:22,367 ERROR  ||  WorkerSourceTask{id=byuser-connector-23} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted   [org.apache.kafka.connect.runtime.WorkerTask]
      org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.
              at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:72)
              at io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:78)
              at io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:29)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:205)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:172)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:118)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: io.grpc.StatusRuntimeException: NOT_FOUND: tablet: cell:"us_east_1e" uid:300240074 is either down or nonexistent
              at io.grpc.Status.asRuntimeException(Status.java:533)
              at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:478)
              at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463)
              at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427)
              at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460)
              at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:616)
              at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:69)
              at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:802)
              at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:781)
              at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
              at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
              ... 3 more
      2023-09-15 20:01:22,367 INFO   ||  Stopping down connector   [io.debezium.connector.common.BaseSourceTask] 

       

      Implementation ideas (optional)

      Expand VitessErrorHandler to retry on this error

            Unassigned Unassigned
            tthorn Thomas Thornton
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: