ISPN-12598 fix, each client operation could decide to switch to another cluster or to the initial server list after max-retries+1 transport errors (e.g. connection/socket timeout). This meant a client with max-retries==0 would attempt to switch after every transport error, causing a pseudo-infinite cycle of back-and-forth switching.
ISPN-12598 fix, a client operation only tries to switch to another cluster or to the initial server list after it has marked all the servers as failed. Now we have the opposite problem: if a client has max-retries < cluster size, a single operation can never mark all the servers as failed, so it will never switch.
The solution is to move the tracking of failed servers from individual operation level (RetryOnFailureOperation) to the remote cache manager level (e.g. to ChannelFactory), and decide globally when to switch.
- Log an error when the initial connection to a server fails (e.g. it times out because the server requires encryption and the client doesn't have it)
- Define a server connection as failed and close it when there is at least one operation waiting for a server response on that connection, and there was no server response for more than socketTimeout millis
- When a server gets to 0 connections, start counting connection attempts agains max-retries.
- When the count of failed connection attempts gets at max-retries, mark the server as failed.
- Only attempt to re-connect to a failed server when there's a new topology update that includes it
- Or at least prevent the client from trying to open more than one connection at the same time
- When all servers are marked as failed, try to switch to another cluster or to the initial server list
- Again, prevent any new connection attempts while a switch is in progress