Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-3675

(7.0.z) Redistribution loses large messages when server with HA is restarted

    Details

    • Target Release:
    • Steps to Reproduce:
      Hide
      git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
      cd eap-tests-hornetq/scripts/
      git checkout refactoring_modules
      groovy -DEAP_VERSION=7.0.0.ER6 PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      export JOURNAL_DIRECTORY_A=$WORKSPACE/journal-A
      export JOURNAL_DIRECTORY_B=$WORKSPACE/journal-B
      export JOURNAL_DIRECTORY_C=$WORKSPACE/journal-C
      export JOURNAL_DIRECTORY_D=$WORKSPACE/journal-D
      
      cd ../jboss-hornetq-testsuite/
      
      mvn clean test -Dtest=ReplicatedColocatedClusterFailoverTestCase#testFailbackWithMdbsShutdown -DfailIfNoTests=false -Deap=7x | tee log
      
      Show
      git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout refactoring_modules groovy -DEAP_VERSION=7.0.0.ER6 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap export JOURNAL_DIRECTORY_A=$WORKSPACE/journal-A export JOURNAL_DIRECTORY_B=$WORKSPACE/journal-B export JOURNAL_DIRECTORY_C=$WORKSPACE/journal-C export JOURNAL_DIRECTORY_D=$WORKSPACE/journal-D cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ReplicatedColocatedClusterFailoverTestCase#testFailbackWithMdbsShutdown -DfailIfNoTests= false -Deap=7x | tee log
    • Affects:
      Release Notes
    • Release Notes Docs Status:
      Documented as Known Issue
    • Sprint:
      EAP 7.0.2

      Description

      Scenario: We have two nodes in (manually created) colocated replicated topology. Both nodes contain InQueue and OutQueue.

      1. We send 2000 messages (mix of large and normal) to InQueue on node 1
      2. On each node we deploy MDB which resend messages from InQueue to OutQueue
      3. During resending of messages we cleanly shutdown node 2 and after some time we start it again
      4. We receive messages from OutQueue on node 1 and check if number of received messages equals to number of send messages

      Expectation: all messages will be resent

      Actual state: some messages are not resent and they are lost

      Customer impact: large messages might get lost in colocated HA topology with replicated journal if one of the servers is cleanly shutdown

      As you can see in [1] and [2], lost messages are stuck in sf.my-cluster queue of node 2 and corresponding large message files have zero length. Bodies of lost messages are in largemessages1, see [3].

      Race condition which cause loss of messages

      1. Node 2 decides to redistribute message-1 to node 1
      2. It creates copy of message-1 with new messageID (message-2) and message-1 is considered as delivered
      3. In the meantime the node 2 is shutting down and thus redistribution of message-2 to node 1 fails
      4. After that backup on node 1 comes to alive and it continues in redistribution of message-2 to live on node 1
      5. Backup knows about message-2 but it does not have body of this message, it sends only header packet and waits for acknowledge from live. Live receives header packet and waits for chunk packets. Both servers wait for each other.
      6. Node 2 is started again. Live on node 2 synchronizes with backup on node 1 and thus it receives message-2 with body of zero length.
      7. Again node 2 sends only header packet and waits for acknowledge and node 1 receives header packet and waits for chunks.
      8. Message-2 is stuck in sf.my-cluster queue and its body is lost.

      [1]

      [standalone@localhost:9990 runtime-queue=sf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b] :list-messages
      {
          "outcome" => "success",
          "result" => [
              {
                  "address" => "jms.queue.InQueue",
                  "color" => "GREEN",
                  "count" => 136,
                  "messageID" => 1162,
                  "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      20
                  ],
                  "counter" => 137,
                  "type" => 3,
                  "priority" => 4,
                  "userID" => "ID:a5b8a3ea-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "durable" => true,
                  "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "expiration" => 0,
                  "_AMQ_DUPL_ID" => "d56d32d3-9678-498a-8376-da1658497cc91457085939341",
                  "timestamp" => 1457085939341L,
                  "_AMQ_LARGE_SIZE" => 409605
              },
              {
                  "address" => "jms.queue.InQueue",
                  "color" => "RED",
                  "count" => 139,
                  "messageID" => 1169,
                  "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      20
                  ],
                  "counter" => 140,
                  "type" => 6,
                  "priority" => 4,
                  "userID" => "ID:a5da83cd-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "durable" => true,
                  "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "expiration" => 0,
                  "_AMQ_DUPL_ID" => "2d2f958d-bb0e-4e2e-9c4b-413cbb4550fc1457085939563",
                  "timestamp" => 1457085939563L,
                  "_AMQ_LARGE_SIZE" => 409615
              },
              {
                  "address" => "jms.queue.InQueue",
                  "color" => "RED",
                  "count" => 137,
                  "messageID" => 1184,
                  "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      20
                  ],
                  "counter" => 138,
                  "type" => 2,
                  "priority" => 4,
                  "userID" => "ID:a5d839db-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "durable" => true,
                  "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "expiration" => 0,
                  "_AMQ_DUPL_ID" => "f58aae80-f118-46c6-a19a-c8e90fec7bc51457085939548",
                  "timestamp" => 1457085939548L,
                  "_AMQ_LARGE_SIZE" => 409617
              },
              {
                  "address" => "jms.queue.InQueue",
                  "color" => "GREEN",
                  "count" => 138,
                  "messageID" => 1189,
                  "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      20
                  ],
                  "counter" => 139,
                  "type" => 5,
                  "priority" => 4,
                  "userID" => "ID:a5d9725c-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "durable" => true,
                  "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "expiration" => 0,
                  "_AMQ_DUPL_ID" => "8dbb9b33-f654-4556-885d-9201c443f2821457085939556",
                  "timestamp" => 1457085939556L,
                  "_AMQ_LARGE_SIZE" => 413163
              },
              {
                  "address" => "jms.queue.InQueue",
                  "color" => "RED",
                  "count" => 145,
                  "messageID" => 1192,
                  "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      20
                  ],
                  "counter" => 146,
                  "type" => 4,
                  "priority" => 4,
                  "userID" => "ID:a5dc5893-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "durable" => true,
                  "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "expiration" => 0,
                  "_AMQ_DUPL_ID" => "763ec86d-010e-455a-95bd-58ca9dd7bf7f1457085939575",
                  "timestamp" => 1457085939575L,
                  "_AMQ_LARGE_SIZE" => 204800
              },
              {
                  "address" => "jms.queue.InQueue",
                  "color" => "GREEN",
                  "count" => 146,
                  "messageID" => 1195,
                  "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      20
                  ],
                  "counter" => 147,
                  "type" => 3,
                  "priority" => 4,
                  "userID" => "ID:a5fba064-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "durable" => true,
                  "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "expiration" => 0,
                  "_AMQ_DUPL_ID" => "2587ff38-91c2-4192-846c-8d796ffd84bb1457085939780",
                  "timestamp" => 1457085939780L,
                  "_AMQ_LARGE_SIZE" => 409605
              },
              {
                  "address" => "jms.queue.InQueue",
                  "color" => "RED",
                  "count" => 147,
                  "messageID" => 1228,
                  "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      0,
                      20
                  ],
                  "counter" => 148,
                  "type" => 2,
                  "priority" => 4,
                  "userID" => "ID:a61b3655-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "durable" => true,
                  "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b",
                  "expiration" => 0,
                  "_AMQ_DUPL_ID" => "b84bebbb-94b8-426a-a848-05a0a078acec1457085939987",
                  "timestamp" => 1457085939987L,
                  "_AMQ_LARGE_SIZE" => 409617
              }
          ]
      }
      

      [2]

      ls -l largemessages
      celkom 0
      -rw-rw-r--. 1 eduda eduda 0 mar  4 11:07 1162.msg
      -rw-rw-r--. 1 eduda eduda 0 mar  4 11:07 1169.msg
      -rw-rw-r--. 1 eduda eduda 0 mar  4 11:07 1184.msg
      -rw-rw-r--. 1 eduda eduda 0 mar  4 11:07 1189.msg
      -rw-rw-r--. 1 eduda eduda 0 mar  4 11:07 1192.msg
      -rw-rw-r--. 1 eduda eduda 0 mar  4 11:07 1195.msg
      -rw-rw-r--. 1 eduda eduda 0 mar  4 11:07 1228.msg
      

      [3]

      ls -l largemessages1
      celkom 2624
      -rw-rw-r--. 1 eduda eduda 409605 mar  4 11:07 1162.msg
      -rw-rw-r--. 1 eduda eduda 409615 mar  4 11:07 1169.msg
      -rw-rw-r--. 1 eduda eduda 409617 mar  4 11:07 1184.msg
      -rw-rw-r--. 1 eduda eduda 413163 mar  4 11:07 1189.msg
      -rw-rw-r--. 1 eduda eduda 204800 mar  4 11:07 1192.msg
      -rw-rw-r--. 1 eduda eduda 409605 mar  4 11:07 1195.msg
      -rw-rw-r--. 1 eduda eduda 409617 mar  4 11:07 1228.msg
      

        Gliffy Diagrams

          Attachments

          1. server1.log.7z
            9.17 MB
          2. server2.log.7z
            1.42 MB
          3. test-suite.log.zip
            1.22 MB

            Issue Links

              Activity

                People

                • Assignee:
                  martyn-taylor Martyn Taylor
                  Reporter:
                  eduda Erich Duda
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  11 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: