When using the FILE_PING protocol it will periodically print the following in the log:
2012-03-19 16:20:41,057 [ Timer-5,<ADDR>] WARN [org.jgroups.protocols.FILE_PING] failed reading 83dc9dfe-8dd4-eff2-4474-d57dbaa96143.node: removing it
This is most likely due to that all members write randomly to the same directory and reading is done without any synchronization to the writes.
Hence running for long enough some point in time the read file will be corrupt.
This occurs more often the slower the shared file system is (e.g. a slow NFS mount).
I will uploaded a patch in which there are two modifications to the FILE_PING class.
1) Writing to files are done in two steps.
First we write to a temporary file in order to avoid that the "readAll" methods picks up a half written file.
Then we do a semi-atomic move of the tmp file to the proper node fil
2) Reading all node files will perform a few re-attempts should it fail to read a file.
This is to provide a simple re-try mechanism should the file be half written and therefore not readable.