Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-26171

(8.0.z) OOM Error after node restart, in a 4 nodes cluster

XMLWordPrintable

    • False
    • None
    • False
    • Hide

      Download and unzip JMeter

      Run the following on e.g. your laptop or some node you will be using as client:

      echo '===================================================='
      echo 'prepare jmeter...'
      echo '===================================================='
      wget https://dlcdn.apache.org//jmeter/binaries/apache-jmeter-5.6.2.zip
      unzip -q apache-jmeter-5.6.2.zip
      

      Prepare a 4 nodes cluster

      You will need another 4 hosts: one for each EAP instance;

      We used 4 VMs with the following properties:

      • RAM: 4GB
      • VCPUs: 2 VCPU
      • Disk: 40GB

      Configure each node as in the following:

      echo '===================================================='
      echo 'prepare eap...'
      echo '===================================================='
      export WILDFLY_ZIP=jboss-eap-8.0.0.GA-CR2.2.zip
      export WILDFLY_DIR=jboss-eap-8.0
      
      rm -rdf $WILDFLY_DIR
      unzip -q $WILDFLY_ZIP
      
      1. create user for the ejb clients
        ./jboss-eap-8.0/bin/add-user.sh -u joe -p secret-Passw0rd -a
        

      Deploy clusterbench on every node of the cluster

      Use the attached [^clusterbench-ee10.ear] (or build it from https://github.com/clusterbench/clusterbench):

      cp clusterbench-ee10.ear ./jboss-eap-8.0/standalone/deployments/
      

      Start the 4 WF instances

      Node1:

      NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p')
      ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly1
      

      Node2:

      NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p')
      ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly2
      

      Node3:

      NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p')
      ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly3
      

      Node4:

      NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p')
      ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly4
      

      Start the JMeter client

      Find the IPs of the nodes where you run WF and store them in the following shell variables:

      NODE1_IP=...
      NODE2_IP=...
      NODE3_IP=...
      NODE4_IP=...
      

      Use the attached clustering-jmeter-samplers-jar-with-dependencies.jar which contains the Java class performing the remote EJB invocation (you can also build it yourself from this repo: https://github.com/tommaso-borgato/clustering-jmeter-samplers) and the TestPlanEJB.jmx JMeter test plan:

      ./apache-jmeter-5.6.2/bin/jmeter -n \
      -t TestPlanEJB.jmx \
      -Jjmeter.save.saveservice.output_format=csv \
      -Jjmeter.save.saveservice.default_delimiter="," \
      -Jjmeter.save.saveservice.autoflush=true \
      -l jmeter_results-perf.csv \
      -Jhost=$NODE1_IP,$NODE2_IP,$NODE3_IP,$NODE4_IP \
      -Jport=8080,8080,8080,8080 \
      -Jpath=/clusterbench/session \
      -Jusername=joe -Jpassword=secret-Passw0rd -Jusers=4000 -Jrampup=60 -Jremote.prog=0 \
      -Jjmeter.save.saveservice.timestamp_format='yyyy/M/dd HH:mm:ss' \
      -Lorg.jboss.eapqe.clustering.jmeter=DEBUG \
      -Juser.classpath=clustering-jmeter-samplers-jar-with-dependencies.jar
      

      Restart the first node

      After the 4 nodes are up and the jmeter client successfully started the 4000 clients, stop (CTRL+C from terminal) and restart the first node (the coordinator)
      then wait some time and, eventually, you will be able to observe the error: memory grows to approximately 4073012kB and then the java process is oom killed;

      NOTE: the error doesn't show up if you run the test locally (4 nodes and client on the same machine)

      Show
      Download and unzip JMeter Run the following on e.g. your laptop or some node you will be using as client: echo '====================================================' echo 'prepare jmeter...' echo '====================================================' wget https://dlcdn.apache.org//jmeter/binaries/apache-jmeter-5.6.2.zip unzip -q apache-jmeter-5.6.2.zip Prepare a 4 nodes cluster You will need another 4 hosts: one for each EAP instance; We used 4 VMs with the following properties: RAM: 4GB VCPUs: 2 VCPU Disk: 40GB Configure each node as in the following: echo '====================================================' echo 'prepare eap...' echo '====================================================' export WILDFLY_ZIP=jboss-eap-8.0.0.GA-CR2.2.zip export WILDFLY_DIR=jboss-eap-8.0 rm -rdf $WILDFLY_DIR unzip -q $WILDFLY_ZIP create user for the ejb clients ./jboss-eap-8.0/bin/add-user.sh -u joe -p secret-Passw0rd -a Deploy clusterbench on every node of the cluster Use the attached [^clusterbench-ee10.ear] (or build it from https://github.com/clusterbench/clusterbench): cp clusterbench-ee10.ear ./jboss-eap-8.0/standalone/deployments/ Start the 4 WF instances Node1: NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p') ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly1 Node2: NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p') ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly2 Node3: NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p') ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly3 Node4: NODE_IP=$(ip -o route get to 8.8.8.8 | sed -n 's/.*src \([0-9.]\+\).*/\1/p') ./jboss-eap-8.0/bin/standalone.sh --server-config=standalone-ha.xml -Djboss.default.multicast.address=230.0.0.190 -b=$NODE_IP -bprivate=$NODE_IP -Djboss.node.name=wildfly4 Start the JMeter client Find the IPs of the nodes where you run WF and store them in the following shell variables: NODE1_IP=... NODE2_IP=... NODE3_IP=... NODE4_IP=... Use the attached clustering-jmeter-samplers-jar-with-dependencies.jar which contains the Java class performing the remote EJB invocation (you can also build it yourself from this repo: https://github.com/tommaso-borgato/clustering-jmeter-samplers ) and the TestPlanEJB.jmx JMeter test plan: ./apache-jmeter-5.6.2/bin/jmeter -n \ -t TestPlanEJB.jmx \ -Jjmeter.save.saveservice.output_format=csv \ -Jjmeter.save.saveservice.default_delimiter="," \ -Jjmeter.save.saveservice.autoflush=true \ -l jmeter_results-perf.csv \ -Jhost=$NODE1_IP,$NODE2_IP,$NODE3_IP,$NODE4_IP \ -Jport=8080,8080,8080,8080 \ -Jpath=/clusterbench/session \ -Jusername=joe -Jpassword=secret-Passw0rd -Jusers=4000 -Jrampup=60 -Jremote.prog=0 \ -Jjmeter.save.saveservice.timestamp_format='yyyy/M/dd HH:mm:ss' \ -Lorg.jboss.eapqe.clustering.jmeter=DEBUG \ -Juser.classpath=clustering-jmeter-samplers-jar-with-dependencies.jar Restart the first node After the 4 nodes are up and the jmeter client successfully started the 4000 clients, stop (CTRL+C from terminal) and restart the first node (the coordinator) then wait some time and, eventually, you will be able to observe the error: memory grows to approximately 4073012kB and then the java process is oom killed; NOTE: the error doesn't show up if you run the test locally (4 nodes and client on the same machine)

      Scenario: we have a 4 nodes cluster where we deploy a clustered application ( [^clusterbench-ee10.ear] ) containing some stateful EJB named RemoteStatefulSB:

      	java:global/clusterbench-ee10/clusterbench-ee10-ejb/RemoteStatefulSBImpl!org.jboss.test.clusterbench.ejb.stateful.RemoteStatefulSB
      	java:app/clusterbench-ee10-ejb/RemoteStatefulSBImpl!org.jboss.test.clusterbench.ejb.stateful.RemoteStatefulSB
      	java:module/RemoteStatefulSBImpl!org.jboss.test.clusterbench.ejb.stateful.RemoteStatefulSB
      	java:jboss/exported/clusterbench-ee10/clusterbench-ee10-ejb/RemoteStatefulSBImpl!org.jboss.test.clusterbench.ejb.stateful.RemoteStatefulSB
      	ejb:clusterbench-ee10/clusterbench-ee10-ejb/RemoteStatefulSBImpl!org.jboss.test.clusterbench.ejb.stateful.RemoteStatefulSB?stateful
      	java:global/clusterbench-ee10/clusterbench-ee10-ejb/RemoteStatefulSBImpl
      	java:app/clusterbench-ee10-ejb/RemoteStatefulSBImpl
      	java:module/RemoteStatefulSBImpl
      

      We have a remote EJB client application performing remote invocations on the cluster nodes, using the emote+http protocol:

      remote+http://10.0.99.245:8080,remote+http://10.0.99.224:8080,remote+http://10.0.99.222:8080,remote+http://10.0.97.56:8080
      

      The client applications creates 4000 sessions; invocations are repeated every 4 seconds;

      Everything works fine if don't restart any node;

      As soon as we shut-down and restart the EAP instance on the first node, EAP memory starts to grow:

      EAP memory continues to grow:

      until, eventually, the java process is killed by the OS:

      $ sudo dmesg | tail -7
      [12671.708228] [   7229]   600  7229    88440      188   200704        0             0 gio
      [12671.709661] [   8476]   600  8476    55691      129    81920        0             0 standalone.sh
      [12671.711250] [   8627]   600  8627  1018253   723463  6549504        0             0 java
      [12671.712720] [   9163]     0  9163    54274       18    69632        0             0 sleep
      [12671.714234] [   9186]   600  9186    65958      141   155648        0             0 top
      [12671.715742] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-600.slice/session-4.scope,task=java,pid=8627,uid=600
      [12671.719219] Out of memory: Killed process 8627 (java) total-vm:4073012kB, anon-rss:2893852kB, file-rss:0kB, shmem-rss:0kB, UID:600 pgtables:6396kB oom_score_adj:0
      

      We attached the memory dumps taken with e.g.

      jmap -dump:live,format=b,file=$JMAP_FILE $MYPID
      

      before the java process was oom killed: jmap-8627-4435.zip

        1. clustering-jmeter-samplers-jar-with-dependencies.jar
          19.45 MB
        2. clustering-jmeter-samplers-jar-with-dependencies-eap8.jar
          19.43 MB
        3. clustering-jmeter-samplers-jar-with-dependencies-ee8.jar
          20.29 MB
        4. jmap-19824-4325.png
          jmap-19824-4325.png
          211 kB
        5. jmap-19824-4325.zip
          129.43 MB
        6. jmap-19824-4406.png
          jmap-19824-4406.png
          200 kB
        7. jmap-19824-4406.zip
          142.37 MB
        8. jmap-8627-4234.png
          jmap-8627-4234.png
          205 kB
        9. jmap-8627-4314.png
          jmap-8627-4314.png
          206 kB
        10. jmap-8627-4354.png
          jmap-8627-4354.png
          209 kB
        11. jmap-8627-4354.zip
          131.50 MB
        12. jmap-8627-4435.png
          jmap-8627-4435.png
          195 kB
        13. jmap-8627-4435.zip
          144.28 MB
        14. jmeterBugTriggerFoundOnClientDiscovery.log.zip
          28.82 MB
        15. log4j2.xml
          5 kB
        16. logs.zip
          101 kB
        17. oom-after-node1-restart.png
          oom-after-node1-restart.png
          173 kB
        18. OOM-README.md
          11 kB
        19. TestPlanEJB.jmx
          4 kB

              flaviarnn Flavia Rainone
              tborgato@redhat.com Tommaso Borgato
              Votes:
              0 Vote for this issue
              Watchers:
              24 Start watching this issue

                Created:
                Updated:
                Resolved: