Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-12652

Cluster is broken after one node is down

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Blocker
    • 13.0.0.Final
    • 9.4.14.Final
    • None
    • None

    Description

      We have 3 nodes in cluster: app1, app2 and app3. App1 was shut down not gracefully because of some hardware issue. After that app2 and app3 started to fail with something like
       
      {{ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p23-t1) ISPN000136: Error executing command RemoveCommand on Cache 'fs.war', writing keys [SessionCreationMetaDataKey(PGARVVdjGKfifzrVfyd7HAllbrwaRG7wLhKha1On)]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 422657 from app1}}
      {{ {{ at org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)}}}}
      {{ {{ at org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)}}}}
      {{ {{ at org.infinispan@9.4.14.Final//org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)}}}}
      {{ {{ at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)}}}}
      {{ {{ at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)}}}}
      {{ {{ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}}}}
      {{ {{ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}}}}
      {{ {{ at java.base/java.lang.Thread.run(Thread.java:834)}}}}
       
      So these 2 nodes (app2 and app3) could not serve user requests anymore until app1 recovered. My question is... Is it ok? Should not Infinispan identify that one of nodes is down, remove it from cluster and notify app2 and app3 about it? I know that there is something like VERIFY_SUSPECT but it didn't happen.
       

      Attachments

        Activity

          People

            pruivo@redhat.com Pedro Ruivo
            coldsun1987@gmail.com Dmitry Kruglikov (Inactive)
            Votes:
            11 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: