Uploaded image for project: 'Red Hat Data Grid'
  1. Red Hat Data Grid
  2. JDG-1739

HotRod client leaking socket connections during java.net.UnknownHostException

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • JDG 7.2 ER5
    • JDG 7.1.0 GA
    • HotRod Java client
    • None
    • ER5
    • +
    • Hide

      1. Set a a cluster of JDG Servers.
      2. Using a Hot Rod client, set the property infinispan.client.hotrod.server_list to a list of JDG servers IPs with a space between each one (see description).
      3. Turn on the org.infinispan TRACE log to see the UnknownHostException in the console.
      4. Observe the open files gowning with the command: "lsof -p <EAP> | grep '*identify protocol' | wc -l

      Show
      1. Set a a cluster of JDG Servers. 2. Using a Hot Rod client, set the property infinispan.client.hotrod.server_list to a list of JDG servers IPs with a space between each one (see description). 3. Turn on the org.infinispan TRACE log to see the UnknownHostException in the console. 4. Observe the open files gowning with the command: "lsof -p <EAP> | grep '*identify protocol' | wc -l
    • JDG Sprint #10

    Description

      The customer set the property datagrid.hosts with a space between the IP list like this:

      <property name="datagrid.hosts" value="10.111.111.1:11222; 10.111.111.2:11222; 10.111.111.2:11222; 10.111.111.3:11222; 10.111.111.4:11222;"/>
      

      This lead to an exception (only catch with trace logging turned on):

      10:47:29,378 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Timer-13) Could not connect to server:  10.111.111.4:11222: java.net.UnknownHostException
      

      The Hot Rod client keeps trying to connect to an unknown host due to this host validation fail. The TcpTransport class creates the socket but the JVM wasn't releasing it, leading to growth on open files count:

      00:00:01       totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw
      ....
      16:30:02       649704       251         9         0         0       143
      16:40:01       649955       248         9         0         0       172
      16:50:01       650127       246         9         0         0        27   --> when it reached the nofile limits
      17:00:01       650120       244         9         0         0        24
      17:10:01       650124       242         9         0         0        26
      17:20:01       650124       234         9         0         0        27
      17:30:01       650121       218         9         0         0        27
      17:40:01       650135       217         9         0         0        25
      

      There were 649683 held by java process:

      $ cat lsof | grep 31226 | grep sock | wc -l
      649683
      

      and "can't identify protocol" is 649681:

      $ cat lsof | grep identify | grep sock | grep 31226 | wc -l
      649681
      

      Our suggestions are:

      1. Correct the addServers method from ConfigurationBuilder class (line 96) to strip the spaces or to consider spaces on the redexp ADDRESS_PATTERN
      2. Set null to socket and socketChannel on the "finally" of the TcpTransport constructor (line 58-66). Without it, the OS is leaving socket files opened until the nolimits reach or when the process shutdown.
      3. The could not connect error message should be a WARN and not a TRACE: log.tracef(e, "Could not connect to server: %s", serverAddress); (line 75). This should be clear on the console logs to warn operations of this problem. The JDG server may be out of reach, causing troubles to the environment.

      Please see the attached linked GSS ticket for more information about this matter.

      Attachments

        Issue Links

          Activity

            People

              rh-ee-galder Galder ZamarreƱo
              rhn-support-zanini Ricardo Zanini Fernandes
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: