Uploaded image for project: 'WildFly WIP'
  1. WildFly WIP
  2. WFWIP-435

Messaging Broker HA Live / Backup Pairs with shared Store on EFS: startup fails

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Blocker Blocker
    • Artemis
    • None
    • Hide
      • create 2 EC2 instances using Red Hat JBoss EAP AMI (RHEL-7-JBEAP-7.4.0_HVM_GA-20210909-x86_64-0-Access2-GP2)
      • Attach the same EFS storage to both nodes
      • configure the nodes using the attached scripts
      Show
      create 2 EC2 instances using Red Hat JBoss EAP AMI (RHEL-7- JBEAP-7 .4.0_HVM_GA-20210909-x86_64-0-Access2-GP2) Attach the same EFS storage to both nodes configure the nodes using the attached scripts
    • Workaround Exists
    • Hide

      restart both EAP instances that compose the HA Live / Backup pair

      Show
      restart both EAP instances that compose the HA Live / Backup pair

      Scenario:

      This scenario is inspired by High Availability - Shared Store and is an attempt to replicate that setup on AWS using AWS EFS as storage:

      • we have 2 EC2 instances created from Red Hat AMI (RHEL-7-JBEAP-7.4.0_HVM_GA-20210909-x86_64-0-Access2-GP2); both EC2 instance type must support multi attach (e.g. t3.medium)
      • the first instance is configured as Live node (LIVE.standalone-ec2-full-ha.xml)
      • the second instance is configured as Backup node (BACKUP.standalone-ec2-full-ha.xml)
      • both instances use shares storage on an external AWS EFS File system which is mounted on both EC2 instances using NFS4 protocol; note this is possible since both EC2 instance types support multi attach

      This scenario presents two main flaws:

      • startup fails
      • EFS is slower if compared to other storage solutions like EBS

      startup fails

      Slave is started at 09:50:20 and Master is started at 09:52:33: slave is started 2,5 minutes before master node;
      Note that if the start sequence is reversed we have the error anyway;

      When you first start the Live/Backup pairs they produce the following errors and you are not able to send/receive messages to/from Master node:

      Slave:

      2022-01-21 09:52:54,398 INFO  [org.infinispan.CLUSTER] (thread-7,ejb,ip-172-31-22-146) ISPN000094: Received new cluster view for channel ejb: [ip-172-31-22-146|1] (2) [ip-172-31-22-146, ip-172-31-18-82]
      2022-01-21 09:52:54,399 INFO  [org.infinispan.CLUSTER] (thread-7,ejb,ip-172-31-22-146) ISPN100000: Node ip-172-31-18-82 joined the cluster
      2022-01-21 09:52:54,544 WARN  [org.apache.activemq.artemis.core.server] (Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3faf32ff)) AMQ222137: Unable to announce backup, retrying: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219012: Timed out waiting to receive initial broadcast from cluster]
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.executeDiscovery(ServerLocatorImpl.java:767)
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:655)
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:549)
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:528)
          at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:267)
          at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
          at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
          at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
      

      Master:

      2022-01-21 09:52:56,287 INFO  [org.infinispan.CLUSTER] (ServerService Thread Pool -- 87) ISPN000094: Received new cluster view for channel ejb: [ip-172-31-22-146|1] (2) [ip-172-31-22-146, ip-172-31-18-82]
      2022-01-21 09:52:56,293 INFO  [org.infinispan.CLUSTER] (ServerService Thread Pool -- 84) ISPN000079: Channel ejb local address is ip-172-31-18-82, physical addresses are [172.31.18.82:7600]
      2022-01-21 09:52:56,297 INFO  [org.infinispan.CLUSTER] (ServerService Thread Pool -- 85) ISPN000079: Channel ejb local address is ip-172-31-18-82, physical addresses are [172.31.18.82:7600]
      2022-01-21 09:52:56,307 INFO  [org.infinispan.CLUSTER] (ServerService Thread Pool -- 87) ISPN000079: Channel ejb local address is ip-172-31-18-82, physical addresses are [172.31.18.82:7600]
      2022-01-21 09:52:56,358 INFO  [org.apache.activemq.artemis.core.server] (ServerService Thread Pool -- 88) AMQ221034: Waiting indefinitely to obtain live lock
      2022-01-21 09:53:06,358 WARN  [org.apache.activemq.artemis.core.server] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@5adf17ff)) AMQ222137: Unable to announce backup, retrying: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219012: Timed out waiting to receive initial broadcast from cluster]
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.executeDiscovery(ServerLocatorImpl.java:767)
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:655)
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:549)
          at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:528)
          at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:267)
          at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
          at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
          at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
      

      Note that, looking at the logs, you can see a cluster is formed but, nevertheless, the Broker doesn't start;
      Complete logs in attached MASTER-server.log and SLAVE-server.log;

      Restarting the EAP instance on Master and Slave nodes solves the issue;
      Complete logs in attached MASTER-AFTER_RESTART-server.log and SLAVE-AFTER_RESTART-server.log;

      EFS is slower if compared to other storage solutions like EBS

      Using a Java client external to AWS we are now able to send/ receive messages from the Master node;

      Looking at performances, it takes 30 seconds to send 200 messages and another 34 seconds to receive 200 messages:

      Fri Jan 21 13:43:59 CET 2022 - Sending 200 messages ...
      Fri Jan 21 13:44:29 CET 2022 - Sent 200 messages.
      Fri Jan 21 13:44:32 CET 2022 - Receiving messages ...
      Fri Jan 21 13:45:06 CET 2022 - Received 200 messages.
      

      If, instead of EFS, we use the default EC2 instance storage (not multi attached) which is EBS, it takes 20 seconds to send 200 messages and another 21 seconds to receive 200 messages:

      Fri Jan 21 13:57:53 CET 2022 - Sending 200 messages ...
      Fri Jan 21 13:58:13 CET 2022 - Sent 200 messages.
      Fri Jan 21 13:58:16 CET 2022 - Receiving messages ...
      Fri Jan 21 13:58:37 CET 2022 - Received 200 messages.
      

      questions

      Is it worth to fix the startup issue and provide support for this scenario on AWS?

        1. BACKUP.standalone-ec2-full-ha.xml
          40 kB
          Tommaso Borgato
        2. LIVE.standalone-ec2-full-ha.xml
          40 kB
          Tommaso Borgato

              ehugonne1@redhat.com Emmanuel Hugonnet
              tborgato@redhat.com Tommaso Borgato
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: