Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-26515

EAP 8 on OpenShift - slow topology update on ARM and s390x due to jgroups socket connection errors

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 8.0.0.GA-CR3
    • Clustering
    • False
    • None
    • False
    • Known Issue

      We noticed a new condition through ** an EAP 8+RHDG interoperability test (EJB distributed timers) on OpenShift.
      We noticed this issue is stable now on AWS ARM based clusters, and the scenario is the following one

      1. 2 members (A, B) EAP 8 cluster with EJB timers configured to be persisted by Infinispan (provided by the EAP cluster itself, not a remote one)
      2. a timer is created, so B starts executing it
      3. during the timer execution, B is non-gracefully terminated (pod is deleted)

      we'd expect that A should take over while C is started by OpenShift to compensate the deleted pod, but instead we can see this only happens once C is ready. Is this expected? There's one survivor which is ready so why it does not take over immediately for the timer execution?

      As stated, this only happens on a cluster where our EAP application service instances take really some time to process topology updates.

      By looking at the logs, we could see that:

      • a message tracing the removal (i.e. {{ member has left the cluster}}) is logged only after ~40 seconds the pod has been deleted
      • the newly started pod boots up and immediately begins to output traces like:
      [0m09:51:09,653 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-7,ee,eap-distributed-ejb-timers-app-1-kvvmw) 10.131.0.69:7600: failed connecting to 10.131.0.68:7600: java.net.SocketTimeoutException: Connect timed out
      09:51:09,653 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-7,ee,eap-distributed-ejb-timers-app-1-kvvmw) 10.131.0.69:7600: removed connection to 10.131.0.68:7600
      09:51:09,653 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-7,ee,eap-distributed-ejb-timers-app-1-kvvmw) JGRP000036: eap-distributed-ejb-timers-app-1-kvvmw: exception sending bundled msgs: java.net.SocketTimeoutException: Connect timed out
      09:51:09,653 TRACE [org.jgroups.protocols.TCP] (TQ-Bundler-7,ee,eap-distributed-ejb-timers-app-1-kvvmw) 10.131.0.69:7600: connecting to 10.131.0.68:7600 

      This affects EAP 8 CR3, and is filed as a critical targeting EAP 8,.0.z GA since it doesn't block the timers feature per-se or violates the Jakarta timers EJB timers spec.

              pferraro@redhat.com Paul Ferraro
              fburzigo Fabio Burzigotti
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: