-
Bug
-
Resolution: Unresolved
-
Blocker
-
None
-
-
Workaround Exists
-
Scenario:
This scenario is inspired by High Availability - Shared Store and is an attempt to replicate that setup on AWS using AWS EFS as storage:
- we have 2 EC2 instances created from Red Hat AMI (RHEL-7-
JBEAP-7.4.0_HVM_GA-20210909-x86_64-0-Access2-GP2); both EC2 instance type must support multi attach (e.g. t3.medium) - the first instance is configured as Live node (LIVE.standalone-ec2-full-ha.xml)
- the second instance is configured as Backup node (BACKUP.standalone-ec2-full-ha.xml)
- both instances use shares storage on an external AWS EFS File system which is mounted on both EC2 instances using NFS4 protocol; note this is possible since both EC2 instance types support multi attach
This scenario presents two main flaws:
- startup fails
- EFS is slower if compared to other storage solutions like EBS
startup fails
Slave is started at 09:50:20 and Master is started at 09:52:33: slave is started 2,5 minutes before master node;
Note that if the start sequence is reversed we have the error anyway;
When you first start the Live/Backup pairs they produce the following errors and you are not able to send/receive messages to/from Master node:
Slave:
2022-01-21 09:52:54,398 INFO [org.infinispan.CLUSTER] (thread-7,ejb,ip-172-31-22-146) ISPN000094: Received new cluster view for channel ejb: [ip-172-31-22-146|1] (2) [ip-172-31-22-146, ip-172-31-18-82] 2022-01-21 09:52:54,399 INFO [org.infinispan.CLUSTER] (thread-7,ejb,ip-172-31-22-146) ISPN100000: Node ip-172-31-18-82 joined the cluster 2022-01-21 09:52:54,544 WARN [org.apache.activemq.artemis.core.server] (Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3faf32ff)) AMQ222137: Unable to announce backup, retrying: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219012: Timed out waiting to receive initial broadcast from cluster] at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.executeDiscovery(ServerLocatorImpl.java:767) at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:655) at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:549) at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:528) at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:267) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Master:
2022-01-21 09:52:56,287 INFO [org.infinispan.CLUSTER] (ServerService Thread Pool -- 87) ISPN000094: Received new cluster view for channel ejb: [ip-172-31-22-146|1] (2) [ip-172-31-22-146, ip-172-31-18-82] 2022-01-21 09:52:56,293 INFO [org.infinispan.CLUSTER] (ServerService Thread Pool -- 84) ISPN000079: Channel ejb local address is ip-172-31-18-82, physical addresses are [172.31.18.82:7600] 2022-01-21 09:52:56,297 INFO [org.infinispan.CLUSTER] (ServerService Thread Pool -- 85) ISPN000079: Channel ejb local address is ip-172-31-18-82, physical addresses are [172.31.18.82:7600] 2022-01-21 09:52:56,307 INFO [org.infinispan.CLUSTER] (ServerService Thread Pool -- 87) ISPN000079: Channel ejb local address is ip-172-31-18-82, physical addresses are [172.31.18.82:7600] 2022-01-21 09:52:56,358 INFO [org.apache.activemq.artemis.core.server] (ServerService Thread Pool -- 88) AMQ221034: Waiting indefinitely to obtain live lock 2022-01-21 09:53:06,358 WARN [org.apache.activemq.artemis.core.server] (Thread-0 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@5adf17ff)) AMQ222137: Unable to announce backup, retrying: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219012: Timed out waiting to receive initial broadcast from cluster] at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.executeDiscovery(ServerLocatorImpl.java:767) at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:655) at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:549) at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.connect(ServerLocatorImpl.java:528) at org.apache.activemq.artemis.core.server.cluster.BackupManager$BackupConnector$1.run(BackupManager.java:267) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Note that, looking at the logs, you can see a cluster is formed but, nevertheless, the Broker doesn't start;
Complete logs in attached MASTER-server.log and SLAVE-server.log;
Restarting the EAP instance on Master and Slave nodes solves the issue;
Complete logs in attached MASTER-AFTER_RESTART-server.log and SLAVE-AFTER_RESTART-server.log;
EFS is slower if compared to other storage solutions like EBS
Using a Java client external to AWS we are now able to send/ receive messages from the Master node;
Looking at performances, it takes 30 seconds to send 200 messages and another 34 seconds to receive 200 messages:
Fri Jan 21 13:43:59 CET 2022 - Sending 200 messages ... Fri Jan 21 13:44:29 CET 2022 - Sent 200 messages. Fri Jan 21 13:44:32 CET 2022 - Receiving messages ... Fri Jan 21 13:45:06 CET 2022 - Received 200 messages.
If, instead of EFS, we use the default EC2 instance storage (not multi attached) which is EBS, it takes 20 seconds to send 200 messages and another 21 seconds to receive 200 messages:
Fri Jan 21 13:57:53 CET 2022 - Sending 200 messages ... Fri Jan 21 13:58:13 CET 2022 - Sent 200 messages. Fri Jan 21 13:58:16 CET 2022 - Receiving messages ... Fri Jan 21 13:58:37 CET 2022 - Received 200 messages.
questions
Is it worth to fix the startup issue and provide support for this scenario on AWS?