The customer encounters situations, when at server start or after some time one of their server starts to write very big error logs very fast, because of some jgroups issue. This can only be stopped by a server restart.
They use AWS based on jgroups for communication
Here are two example log messages:
server start:
[2020-09-29 06:29:01.625 +0200] ERROR <*host name*> ES0 appserver0 [] [] org.jgroups.protocols.FILE_PING [] [] [] [] "ip-<**-**-**-**><*host name*>-startStop-1" JGRP000187: failed reading line of input stream java.lang.IllegalArgumentException: Invalid UUID string: at org.jgroups.util.UUID.fromString(UUID.java:139) at org.jgroups.protocols.Discovery.read(Discovery.java:455) at org.jgroups.protocols.FILE_PING.read(FILE_PING.java:269) at org.jgroups.protocols.FILE_PING.read(FILE_PING.java:263) at org.jgroups.protocols.FILE_PING.readAll(FILE_PING.java:239) at org.jgroups.protocols.FILE_PING.findMembers(FILE_PING.java:124) at org.jgroups.protocols.Discovery.invokeFindMembers(Discovery.java:217) at org.jgroups.protocols.Discovery.findMembers(Discovery.java:244) at org.jgroups.protocols.Discovery.down(Discovery.java:383) at org.jgroups.protocols.FILE_PING.down(FILE_PING.java:119) at org.jgroups.protocols.MERGE3.down(MERGE3.java:274) at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:375) at org.jgroups.protocols.FD_ALL2.down(FD_ALL2.java:214) at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:102) at org.jgroups.protocols.pbcast.NAKACK2.down(NAKACK2.java:555) at org.jgroups.protocols.UNICAST3.down(UNICAST3.java:595) at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:350) at org.jgroups.protocols.pbcast.ClientGmsImpl.joinInternal(ClientGmsImpl.java:72) at org.jgroups.protocols.pbcast.ClientGmsImpl.join(ClientGmsImpl.java:41) at org.jgroups.protocols.pbcast.GMS.down(GMS.java:1054) at org.jgroups.protocols.FlowControl.down(FlowControl.java:300) at org.jgroups.protocols.FRAG2.down(FRAG2.java:138) at org.jgroups.protocols.RSVP.down(RSVP.java:103) at org.jgroups.protocols.pbcast.StreamingStateTransfer.down(StreamingStateTransfer.java:183) at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:908) at org.jgroups.JChannel.down(JChannel.java:627) at org.jgroups.JChannel._connect(JChannel.java:855) at org.jgroups.JChannel.connect(JChannel.java:352) at org.jgroups.JChannel.connect(JChannel.java:343) at com.intershop.beehive.messaging.internal.jgroups.JGroupsMessenger.createJGroupsObjAndReceiver(JGroupsMessenger.java:230) at com.intershop.beehive.messaging.internal.jgroups.JGroupsMessenger.createResourcesAndReceivers(JGroupsMessenger.java:183) at com.intershop.beehive.messaging.capi.AbstractMessenger.initialize(AbstractMessenger.java:154) at com.intershop.beehive.messaging.capi.MessengerProvider.createMessengerInstance(MessengerProvider.java:133) at com.intershop.beehive.messaging.capi.MessengerProvider.createMessenger(MessengerProvider.java:68) at com.intershop.beehive.core.internal.event.EventMgrImpl.onStartupHook(EventMgrImpl.java:168) at com.intershop.beehive.core.internal.environment.ServerEnvironment.onStartupHook(ServerEnvironment.java:1126) at com.intershop.beehive.core.internal.environment.ServerEnvironment.startup(ServerEnvironment.java:2064) at com.intershop.beehive.core.internal.environment.ApplicationContextListener.contextInitialized(ApplicationContextListener.java:47) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4939) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5434) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1559) at org.apache.catalina.core.ContainerBase$StartChild.call(ContainerBase.java:1549) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
issue happening later:
[2020-09-27 03:55:14.342 +0200] ERROR <*host name*> ES0 appserver0 [] [] org.jgroups.protocols.FILE_PING [] [] [] [] "Timer runner-2,eventChannel,ip-<**-**-**-**>-21247" JGRP000187: failed reading line of input stream java.lang.IllegalArgumentException: Invalid UUID string: at org.jgroups.util.UUID.fromString(UUID.java:139) at org.jgroups.protocols.Discovery.read(Discovery.java:455) at org.jgroups.protocols.FILE_PING.read(FILE_PING.java:269) at org.jgroups.protocols.FILE_PING.read(FILE_PING.java:263) at org.jgroups.protocols.FILE_PING.readAll(FILE_PING.java:239) at org.jgroups.protocols.FILE_PING.findMembers(FILE_PING.java:124) at org.jgroups.protocols.Discovery.invokeFindMembers(Discovery.java:217) at org.jgroups.protocols.Discovery.findMembers(Discovery.java:244) at org.jgroups.protocols.Discovery.down(Discovery.java:387) at org.jgroups.protocols.FILE_PING.down(FILE_PING.java:119) at org.jgroups.protocols.MERGE3$InfoSender.run(MERGE3.java:412) at org.jgroups.util.TimeScheduler3$Task.run(TimeScheduler3.java:328) at org.jgroups.util.TimeScheduler3$RecurringTask.run(TimeScheduler3.java:362) at org.jgroups.util.TimeScheduler3.submitToPool(TimeScheduler3.java:244) at org.jgroups.util.TimeScheduler3.run(TimeScheduler3.java:200) at java.lang.Thread.run(Thread.java:748)
In the stack traces the original IPs/ host names of the servers were replaced by <*-**>/<*host name>.
In both cases the message is written very fast very often and always from the same thread.
We have error log examples of these sizes:
9,1 GB (server startup) - 2453051 occurrences from [2020-09-29 06:19:01.013 +0200] until [2020-09-29 06:29:01.983 +0200] (~10 min)
450 MB ((the timer issue above) - 349586 occurrences from [2020-09-27 03:55:14.342 +0200] until [2020-09-27 04:10:20.213 +0200] (~15 min)
Steps to repeat:
- this happens occasionally. no steps to repeat available
Investigations:
Especially the second stack trace consists of jgroups classes only.
With the given sources I analyzed how these messages could be created.
message JGRP000187 is defined here: JGroups/conf/jg-messages.properties
FailedReadingLineOfInputStream = JGRP000187: failed reading line of input stream
Look around this line from the stack trace
---------------------------------------------------------------------------
at org.jgroups.util.UUID.fromString(UUID.java:139)
---------------------------------------------------------------------------
/*136*/ public static UUID fromString(String name) { /*137*/ String[] components = name.split("-"); /*138*/ if (components.length != 5) /*139*/ throw new IllegalArgumentException("Invalid UUID string: "+name);
The exception is thrown when the provided string cannot be split into 5 parts by '-".
We see that the string name is logged appended to the message "Invalid UUID string: "
In our error message example no value can be seen for name so it must be an empty string that was given to the method UUID.fromString(String name). An empty string obviously does not have 5 parts when split by '-'
Where is the UUID str given to org.jgroups.util.UUID.fromString?
Look around this line from the stack trace
---------------------------------------------------------------------------
at org.jgroups.protocols.Discovery.read(Discovery.java:455)
---------------------------------------------------------------------------
/*437*/ protected List<PingData> read(InputStream in) { /*438*/ List<PingData> retval=null; /*439*/ try { /*440*/ while(true) { /*441*/ try { /*442*/ String name_str=Util.readToken(in); /*443*/ String uuid_str=Util.readToken(in); /*444*/ String addr_str=Util.readToken(in); /*445*/ String coord_str=Util.readToken(in); /*446*/ if(name_str == null || uuid_str == null || addr_str == null || coord_str == null) /*447*/ break; /*448*/ /*449*/ UUID uuid=null; /*450*/ try { /*451*/ long tmp=Long.parseLong(uuid_str); /*452*/ uuid=new UUID(0, tmp); /*453*/ } /*454*/ catch(Throwable t) { /*455*/ uuid=UUID.fromString(uuid_str); /*456*/ } /*457*/ /*458*/ PhysicalAddress phys_addr=new IpAddress(addr_str); /*459*/ boolean is_coordinator=coord_str.trim().equals("T") || coord_str.trim().equals("t"); /*460*/ /*461*/ if(retval == null) /*462*/ retval=new ArrayList<>(); /*463*/ retval.add(new PingData(uuid, true, name_str, phys_addr).coord(is_coordinator)); /*464*/ } /*465*/ catch(Throwable t) { /*466*/ log.error(Util.getMessage("FailedReadingLineOfInputStream"), t); /*467*/ } /*468*/ } /*469*/ return retval; /*470*/ } /*471*/ finally { /*472*/ Util.close(in); /*473*/ } /*474*/ }
Here from an input stream four values are read for the different servers in the jgroups cluster: name_str, uuid_str, addr_str and coord_str
This happens in a while(true) loop.
We can leave this loop only by a break.
This break happens at line 447 when the condition in 446 is met:
if(name_str == null || uuid_str == null || addr_str == null || coord_str == null)
So if at least one of the four strings read before would be null we could leave this loop.
We can see that the catch block simply logs the message but does not inerrupt the outer while loop.
The string uuid_str is read into the local variable in line 443 by String uuid_str=Util.readToken(in)
Lets have a look at Util.readToken:
/*1774*/ /** Returns whitespace-separated strings from the input stream, or null if the end of the stream has been reached */ /*1775*/ public static String readToken(InputStream in) { /*1776*/ StringBuilder sb=new StringBuilder(); /*1777*/ boolean first=true; /*1778*/ int ch; /*1779*/ while(true) { /*1780*/ try { /*1781*/ ch=in.read(); /*1782*/ if(ch == -1) /*1783*/ return sb.length() > 0? sb.toString() : null; /*1784*/ if(Character.isWhitespace(ch)) { /*1785*/ if(first) /*1786*/ continue; /*1787*/ break; /*1788*/ } /*1789*/ sb.append((char)ch); /*1790*/ first=false; /*1791*/ } /*1792*/ catch(IOException e) { /*1793*/ break; /*1794*/ } /*1795*/ } /*1796*/ return sb.toString(); /*1797*/ }
A new instance of StringBuilder is created at line 1776.
The boolean variable first is set to true in line 1777
Then a while(true) loop is started.
It can only be left by a break.
In the try block we try to read from the given input stream in.
1. If the read character is -1 ( end of the stream ) the while loop is left and a value is returned.
If the string builder sb already has a length > 0 it will be converted to a string and returned. If sb is still empty we return a null.
2. If the character is a white space character and we are still at the beginning of the reading (first is true) we will continue until we find a none-whitespace character or the end of the stream. If first is no longer true when a white space character is read the break will bring us to the end of the loop.
Here all read content is returned as a string converted from sb.
3. The character that was read is not a -1 and not a whitespace: We append it to sb and set first to false.
4. In case of an IOException the while loop is left, no log message is written.
Here all read content is returned as a string converted from sb.
The only possibility for this method to return an empty string (not null) would be: We encounter an IOException at the first reading from in or after only white space characters were read.
Then the code on line 1796 would return an empty string.
Now have a look at the code read of Discovery again:
Reaching the end of the input stream in Util.readToken(in) would be the condition to end reading from the input stream and to return all found PingData as a list.
With an empty string for every read string variable the break will never be reached and the while loop will not be left, writing an error message at the caught exception on the trial to convert the empty string in uuid_str to a valid UUID.
So in case the provided input stream throws an IOException on every trial to read from it, we would be in an endless loop.
We can have a look at which input stream we are in here
Look around this line from the stack trace
---------------------------------------------------------------------------
at org.jgroups.protocols.FILE_PING.readAll(FILE_PING.java:239)
---------------------------------------------------------------------------
/*256*/ protected void readAll(List<Address> members, String clustername, Responses responses) { /*257*/ File dir=new File(root_dir, clustername); /*253*/ if(!dir.exists()) /*254*/ dir.mkdir(); /*230*/ /*231*/ File[] files=dir.listFiles(filter); // finds all files ending with '.list' /*232*/ for(File file: files) { /*233*/ List<PingData> list=null; /*234*/ // implementing a simple spin lock doing a few attempts to read the file /*235*/ // this is done since the file may be written in concurrency and may therefore not be readable /*236*/ for(int i=0; i < 3; i++) { /*237*/ if(file.exists()) { /*238*/ try { /*239*/ if((list=read(file)) != null) /*240*/ break; /*241*/ } /*242*/ catch(Exception e) { /*243*/ } /*244*/ } /*245*/ Util.sleep(50); /*246*/ } /*247*/ /*248*/ if(list == null) { /*249*/ log.warn("failed reading " + file.getAbsolutePath()); /*250*/ continue; /*251*/ } /*252*/ for(PingData data: list) { /*253*/ if(members == null || members.contains(data.getAddress())) /*254*/ responses.addResponse(data, true); /*255*/ if(local_addr != null && !local_addr.equals(data.getAddress())) /*256*/ addDiscoveryResponseToCaches(data.getAddress(), data.getLogicalName(), data.getPhysicalAddr()); /*257*/ } /*258*/ } /*259*/ }
Here we read the ping data from a file in a directory that was scanned for all files with the extension .list.
In our application server that would be one of these directories:
[IS_SHARE]/system/jgroups/appserverSyncChannel
[IS_SHARE]/system/jgroups/cacheSyncChannel
[IS_SHARE]/system/jgroups/ormSyncChannel
[IS_SHARE]/system/jgroups/tcmSyncChannel
So my assumption is that our error occurs when somehow a file that just existed is not reachable anymore (if deleted or unreachable because of network issues) and the input stream for it gets unusable, throwing IOExceptions on every trial to read from it.
My idea how to fix that would be to change the return value of Util readToken on line 1796 like this:
old: return sb.toString(); new: return sb.length() > 0? sb.toString() : null;
Then we would not return an empty value on an IOException and could break the loop in Discovery for a better error handling.
I already had a look at the code in the most recent version 5.0.4: All involved methods do look still the same.
I assume we should file a bug report to the jgroups community.
We suspect an endless loop in the jgroups code.
The used jgroups version is 4.2.0.
The warn log does not contain any relevant messages for jgroups at this time
I can provide the original logs for analysis