Uploaded image for project: 'Agroal'
  1. Agroal
  2. AG-274

Database is available, but db pool is stuck when creating new connections

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 2.8
    • pool
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Important

      I´m currently trying to integrate TiDB with Keycloak.
      TiDB is a mysql compatbile & horizontally scalable, fault-tolerant database.

      While testing the HA failover capabilities of TiDB I ran into some problems.
      The feature of TiDB I wanted to explicitly test is, that killing components inside the TiDBCluster lead only to exceptions within currently running connections. From TiDB Cluster perspective it is always possible to start a fresh connection with the TiDB Cluster Endpoint.

      So my expectation is that if I kill a TiDB Cluster component, a currently running request to Keycloak will fail, cause the currently running db connection fails. But then it should be immediately possible for Keycloak to acquire a new connection.

      The test scenario is that there are ~100 http req/sec hitting Keycloak (via several Keycloak Benchmark clients). Majority of those req are using a database connection.

      My described expectation fails. As a summary: When the database was down for a short moment the connection pool didn´t recover as quick as the database in some situations.

      At the beginning I set:
      poolInitialSize: 75
      poolMaxSize: 75
      poolMinSize: 75
      url: jdbc:mysql://tiproxy:6000/keycloak?sslmode=verify-identity&ssl-ca=/opt/keycloak/.tidb/ca.crt&ssl-key=/opt/keycloak/.tidb/tls.key&ssl-cert=/opt/keycloak/.tidb/tls.crt

      I described the original error with more details here: https://github.com/keycloak/keycloak/issues/42256
      I tried to mitigate the observed problems by additionally setting:
      QUARKUS_DATASOURCE_JDBC_ACQUISITION_TIMEOUT="5S"
      QUARKUS_DATASOURCE_JDBC_VALIDATE_ON_BORROW="true"

      That fixed some error behaviors, but it was still the case that sometimes connections were stuck during validation, therefore I set:
      QUARKUS_DATASOURCE_JDBC_VALIDATION_QUERY_TIMEOUT="5S"

      But now there is the behavior that occasionally everything is stuck for ~2mins. No idle connections left in the ConnectionPool. But important thing is, that TiDB perfectly accepts new connections (tested this with a python script, which runs in parallel and acquires successfully a new connection every second)

      I started investigating and found out a few things in my specific configuration:

      When a http-thread acquired a connection, then it is validated. (This can´t stuck as there is now the VALIDATION_QUERY_TIMEOUT set). If validation fails, then the connection is removed from the connectionPool.

      Then the http-thread tries to acquire another connection (loop here).
      During this acquire call by the http-thread, it checks if the connectionpool size is not met anymore (due to connection was removed). If its not met, then a request to create a new connection is pushed to the PriorityScheduledExecutor (here or here)

      This PriorityScheduledExecutor is running with a SINGLE thread, which means it creates new connections sequentially. Even though there are multiple http-thread which needs immediately a new connection. This takes some time, which is not nice, but not the current problem. The problem I observed occurs if one connection creation call get´s stuck. Then no new connections are created until the current stuck call completes (As all connections are created only by a single thread).
      And exactly that happens from time to time.

      I found the config property loginTimeout. This can´t be set by Keycloak or Quarkus. But I managed to inject it with a custom build. But that didn´t helped.

      I did a ThreadDump when the situation occured and the "agroal" thread had the following stack trace:

      "agroal-11@18610" tid=0x24 nid=NA runnable
        java.lang.Thread.State: RUNNABLE
      	  at sun.nio.ch.SocketDispatcher.read0(SocketDispatcher.java:-1)
      	  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47)
      	  at sun.nio.ch.NioSocketImpl.tryRead(NioSocketImpl.java:256)
      	  at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:307)
      	  at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:346)
      	  at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:796)
      	  at java.net.Socket$SocketInputStream.read(Socket.java:1099)
      	  at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:489)
      	  at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:483)
      	  at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70)
      	  at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1461)
      	  at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1066)
      	  at java.io.FilterInputStream.read(FilterInputStream.java:119)
      	  at com.mysql.cj.protocol.FullReadInputStream.readFully(FullReadInputStream.java:64)
      	  at com.mysql.cj.protocol.a.SimplePacketReader.readHeaderLocal(SimplePacketReader.java:81)
      	  at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:63)
      	  at com.mysql.cj.protocol.a.SimplePacketReader.readHeader(SimplePacketReader.java:45)
      	  at com.mysql.cj.protocol.a.NativeProtocol.readMessage(NativeProtocol.java:576)
      	  at com.mysql.cj.protocol.a.NativeProtocol.checkErrorMessage(NativeProtocol.java:762)
      	  at com.mysql.cj.protocol.a.NativeProtocol.checkErrorMessage(NativeProtocol.java:740)
      	  at com.mysql.cj.protocol.a.NativeProtocol.checkErrorMessage(NativeProtocol.java:157)
      	  at com.mysql.cj.protocol.a.NativeAuthenticationProvider.proceedHandshakeWithPluggableAuthentication(NativeAuthenticationProvider.java:471)
      	  at com.mysql.cj.protocol.a.NativeAuthenticationProvider.connect(NativeAuthenticationProvider.java:215)
      	  at com.mysql.cj.protocol.a.NativeProtocol.connect(NativeProtocol.java:1428)
      	  at com.mysql.cj.NativeSession.connect(NativeSession.java:133)
      	  at com.mysql.cj.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:935)
      	  at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:805)
      	  - locked <0x4e14> (a com.mysql.cj.jdbc.ConnectionImpl)
      	  at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:438)
      	  at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:241)
      	  at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:189)
      	  at io.agroal.pool.ConnectionFactory.createConnection(ConnectionFactory.java:225)
      	  at io.agroal.pool.ConnectionPool$CreateConnectionTask.call(ConnectionPool.java:625)
      	  at io.agroal.pool.ConnectionPool$CreateConnectionTask.call(ConnectionPool.java:606)
      	  at java.util.concurrent.FutureTask.run(FutureTask.java:317)
      	  at io.agroal.pool.util.PriorityScheduledExecutor.beforeExecute(PriorityScheduledExecutor.java:75)
      	  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
      	  at java.lang.Thread.runWith(Thread.java:1596)
      	  at java.lang.Thread.run(Thread.java:1583)
      

      This part of the stacktrace looked interesting:
      sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:307)

      if (timeout > 0) {
       // read with timeout
       n = timedRead(fd, b, off, len, MILLISECONDS.toNanos(timeout));
      } else {
       // read, no timeout
       n = tryRead(fd, b, off, len);
       while (IOStatus.okayToRetry(n) && isOpen()) {
      	park(fd, Net.POLLIN);
      	n = tryRead(fd, b, off, len);
       }
      }
      

      Because when the create connection code runs, then the timeout is 0 and therefore waiting infinitely for a socket to read something. And that happens obviously from time to time, which results in that stuck situation.

      The timeout is set within the com.mysql.cj.protocol.a.NativeSocketConnection class
      in the connect method by looking up the socketTimeout property:

      
      int socketTimeout = propSet.getIntegerProperty(PropertyKey.socketTimeout).getValue();
       if (socketTimeout != 0) {
          try {
             this.mysqlSocket.setSoTimeout(socketTimeout);
          } catch (Exception ex) {
              /* Ignore if the platform does not support it */
           }
      } 
      

      socketTimeout can be set within the jdbc connection url

      To be sure I also set the connectTimeout

      Therefore I configured following url:
      jdbc:mysql://tiproxy:6000/keycloak?socketTimeout=5000&connectTimeout=5000&sslmode=verify-identity&ssl-ca=/opt/keycloak/.tidb/ca.crt&ssl-key=/opt/keycloak/.tidb/tls.key&ssl-cert=/opt/keycloak/.tidb/tls.crt

      That resolves the occasionally occurring situation of stuck db connection creation or limit it at least to be no longer stuck than 5 seconds.
      I am not sure what side effect does it has on the execution of normal SQL statements by the application, which uses this created connection.

      Ideally, it would be possible to specify a dedicated agroal config, which set the socketTimeout, but only taking effect during connection-creation time (if that is possible?)

              lbarreiro-1 Luis Barreiro
              kniepdennis@gmail.com Dennis Kniep
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: