-
Bug
-
Resolution: Done
-
Critical
-
JDG 7.1.0 GA
-
None
The customer set the property datagrid.hosts with a space between the IP list like this:
<property name="datagrid.hosts" value="10.111.111.1:11222; 10.111.111.2:11222; 10.111.111.2:11222; 10.111.111.3:11222; 10.111.111.4:11222;"/>
This lead to an exception (only catch with trace logging turned on):
10:47:29,378 TRACE [org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport] (Timer-13) Could not connect to server: 10.111.111.4:11222: java.net.UnknownHostException
The Hot Rod client keeps trying to connect to an unknown host due to this host validation fail. The TcpTransport class creates the socket but the JVM wasn't releasing it, leading to growth on open files count:
00:00:01 totsck tcpsck udpsck rawsck ip-frag tcp-tw .... 16:30:02 649704 251 9 0 0 143 16:40:01 649955 248 9 0 0 172 16:50:01 650127 246 9 0 0 27 --> when it reached the nofile limits 17:00:01 650120 244 9 0 0 24 17:10:01 650124 242 9 0 0 26 17:20:01 650124 234 9 0 0 27 17:30:01 650121 218 9 0 0 27 17:40:01 650135 217 9 0 0 25
There were 649683 held by java process:
$ cat lsof | grep 31226 | grep sock | wc -l 649683
and "can't identify protocol" is 649681:
$ cat lsof | grep identify | grep sock | grep 31226 | wc -l 649681
Our suggestions are:
1. Correct the addServers method from ConfigurationBuilder class (line 96) to strip the spaces or to consider spaces on the redexp ADDRESS_PATTERN
2. Set null to socket and socketChannel on the "finally" of the TcpTransport constructor (line 58-66). Without it, the OS is leaving socket files opened until the nolimits reach or when the process shutdown.
3. The could not connect error message should be a WARN and not a TRACE: log.tracef(e, "Could not connect to server: %s", serverAddress); (line 75). This should be clear on the console logs to warn operations of this problem. The JDG server may be out of reach, causing troubles to the environment.
Please see the attached linked GSS ticket for more information about this matter.