Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-4389

VStream gRPC connection closed after being idle for a few minutes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 1.8.0.CR1
    • None
    • vitess-connector
    • None
    • False
    • False
    • Hide
      1. Start Kafka Connect with Vitess plugin
      2. Optionally insert a row or two into Vitess using mysql cli, or do nothing.
      3. Wait for a few minutes. On laptop this is typically 5+ minutes, and in prod it's around 10 minutes.
      4. The error occurred in Kafka Connect worker and connector stopped working with the task in FAILED state. The error can be found in Kafka Connect worker log or hitting the REST API for task status, such as curl -s localhost:8083/connectors/sample_single_shard_connector/status
      Show
      Start Kafka Connect with Vitess plugin Optionally insert a row or two into Vitess using mysql cli, or do nothing. Wait for a few minutes. On laptop this is typically 5+ minutes, and in prod it's around 10 minutes. The error occurred in Kafka Connect worker and connector stopped working with the task in FAILED state. The error can be found in Kafka Connect worker log or hitting the REST API for task status, such as curl -s localhost:8083/connectors/sample_single_shard_connector/status

      When the Vitess connector is started and there are no database changes that would trigger change events, the VStream gRPC connection will be closed in a few minutes and the Vitess connector crashes with the task turning into FAILED state.

      I can always reproduce this in both local and our production environments. I have observed two types of error messages, but both are of "UNAVAILABLE" gRPC status code.

      Local laptop where Vitess is running in Docker container:

      Caused by: java.lang.RuntimeException: io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason	at io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:74)	... 8 moreCaused by: io.grpc.StatusRuntimeException: UNAVAILABLE: Network closed for unknown reason 

      In production environment where Envoy proxy is used:

      Caused by: java.lang.RuntimeException: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERRORReceived Rst Stream\tat io.debezium.connector.vitess.VitessStreamingChangeEventSource.execute(VitessStreamingChangeEventSource.java:74)\t... 9 moreCaused by: io.grpc.StatusRuntimeException: UNAVAILABLE: HTTP/2 error code: NO_ERRORReceived Rst Stream 

      I'm working on proposed fixes which are:

      • Enable keepalive pings for the VStream gRPC. I tested out 1-minute interval which can prevent the crash. This interval can be made configurable. Setting this value is not required but recommended.
      • Implement a VitessErrorHandler and mark StatusRuntimeException that has UNAVAILABLE status code as retriable as per the recommendation. https://grpc.github.io/grpc/core/md_doc_statuscodes.html

            Unassigned Unassigned
            shichaoan Shichao An
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: