-
Bug
-
Resolution: Done
-
Critical
-
7.0.0.ER6
-
Release Notes
-
-
-
-
-
-
Documented as Known Issue
-
Scenario: We have two nodes in (manually created) colocated replicated topology. Both nodes contain InQueue and OutQueue.
- We send 2000 messages (mix of large and normal) to InQueue on node 1
- On each node we deploy MDB which resend messages from InQueue to OutQueue
- During resending of messages we cleanly shutdown node 2 and after some time we start it again
- We receive messages from OutQueue on node 1 and check if number of received messages equals to number of send messages
Expectation: all messages will be resent
Actual state: some messages are not resent and they are lost
Customer impact: large messages might get lost in colocated HA topology with replicated journal if one of the servers is cleanly shutdown
As you can see in [1] and [2], lost messages are stuck in sf.my-cluster queue of node 2 and corresponding large message files have zero length. Bodies of lost messages are in largemessages1, see [3].
Race condition which cause loss of messages
- Node 2 decides to redistribute message-1 to node 1
- It creates copy of message-1 with new messageID (message-2) and message-1 is considered as delivered
- In the meantime the node 2 is shutting down and thus redistribution of message-2 to node 1 fails
- After that backup on node 1 comes to alive and it continues in redistribution of message-2 to live on node 1
- Backup knows about message-2 but it does not have body of this message, it sends only header packet and waits for acknowledge from live. Live receives header packet and waits for chunk packets. Both servers wait for each other.
- Node 2 is started again. Live on node 2 synchronizes with backup on node 1 and thus it receives message-2 with body of zero length.
- Again node 2 sends only header packet and waits for acknowledge and node 1 receives header packet and waits for chunks.
- Message-2 is stuck in sf.my-cluster queue and its body is lost.
[1]
[standalone@localhost:9990 runtime-queue=sf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b] :list-messages { "outcome" => "success", "result" => [ { "address" => "jms.queue.InQueue", "color" => "GREEN", "count" => 136, "messageID" => 1162, "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [ 0, 0, 0, 0, 0, 0, 0, 20 ], "counter" => 137, "type" => 3, "priority" => 4, "userID" => "ID:a5b8a3ea-e1f0-11e5-a3fa-7f78e6f9d09b", "durable" => true, "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b", "expiration" => 0, "_AMQ_DUPL_ID" => "d56d32d3-9678-498a-8376-da1658497cc91457085939341", "timestamp" => 1457085939341L, "_AMQ_LARGE_SIZE" => 409605 }, { "address" => "jms.queue.InQueue", "color" => "RED", "count" => 139, "messageID" => 1169, "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [ 0, 0, 0, 0, 0, 0, 0, 20 ], "counter" => 140, "type" => 6, "priority" => 4, "userID" => "ID:a5da83cd-e1f0-11e5-a3fa-7f78e6f9d09b", "durable" => true, "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b", "expiration" => 0, "_AMQ_DUPL_ID" => "2d2f958d-bb0e-4e2e-9c4b-413cbb4550fc1457085939563", "timestamp" => 1457085939563L, "_AMQ_LARGE_SIZE" => 409615 }, { "address" => "jms.queue.InQueue", "color" => "RED", "count" => 137, "messageID" => 1184, "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [ 0, 0, 0, 0, 0, 0, 0, 20 ], "counter" => 138, "type" => 2, "priority" => 4, "userID" => "ID:a5d839db-e1f0-11e5-a3fa-7f78e6f9d09b", "durable" => true, "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b", "expiration" => 0, "_AMQ_DUPL_ID" => "f58aae80-f118-46c6-a19a-c8e90fec7bc51457085939548", "timestamp" => 1457085939548L, "_AMQ_LARGE_SIZE" => 409617 }, { "address" => "jms.queue.InQueue", "color" => "GREEN", "count" => 138, "messageID" => 1189, "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [ 0, 0, 0, 0, 0, 0, 0, 20 ], "counter" => 139, "type" => 5, "priority" => 4, "userID" => "ID:a5d9725c-e1f0-11e5-a3fa-7f78e6f9d09b", "durable" => true, "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b", "expiration" => 0, "_AMQ_DUPL_ID" => "8dbb9b33-f654-4556-885d-9201c443f2821457085939556", "timestamp" => 1457085939556L, "_AMQ_LARGE_SIZE" => 413163 }, { "address" => "jms.queue.InQueue", "color" => "RED", "count" => 145, "messageID" => 1192, "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [ 0, 0, 0, 0, 0, 0, 0, 20 ], "counter" => 146, "type" => 4, "priority" => 4, "userID" => "ID:a5dc5893-e1f0-11e5-a3fa-7f78e6f9d09b", "durable" => true, "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b", "expiration" => 0, "_AMQ_DUPL_ID" => "763ec86d-010e-455a-95bd-58ca9dd7bf7f1457085939575", "timestamp" => 1457085939575L, "_AMQ_LARGE_SIZE" => 204800 }, { "address" => "jms.queue.InQueue", "color" => "GREEN", "count" => 146, "messageID" => 1195, "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [ 0, 0, 0, 0, 0, 0, 0, 20 ], "counter" => 147, "type" => 3, "priority" => 4, "userID" => "ID:a5fba064-e1f0-11e5-a3fa-7f78e6f9d09b", "durable" => true, "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b", "expiration" => 0, "_AMQ_DUPL_ID" => "2587ff38-91c2-4192-846c-8d796ffd84bb1457085939780", "timestamp" => 1457085939780L, "_AMQ_LARGE_SIZE" => 409605 }, { "address" => "jms.queue.InQueue", "color" => "RED", "count" => 147, "messageID" => 1228, "_AMQ_ROUTE_TOsf.my-cluster.6ef15b5a-e1f0-11e5-b678-65948414801b" => [ 0, 0, 0, 0, 0, 0, 0, 20 ], "counter" => 148, "type" => 2, "priority" => 4, "userID" => "ID:a61b3655-e1f0-11e5-a3fa-7f78e6f9d09b", "durable" => true, "__AMQ_CID" => "a165c4ff-e1f0-11e5-a3fa-7f78e6f9d09b", "expiration" => 0, "_AMQ_DUPL_ID" => "b84bebbb-94b8-426a-a848-05a0a078acec1457085939987", "timestamp" => 1457085939987L, "_AMQ_LARGE_SIZE" => 409617 } ] }
[2]
ls -l largemessages celkom 0 -rw-rw-r--. 1 eduda eduda 0 mar 4 11:07 1162.msg -rw-rw-r--. 1 eduda eduda 0 mar 4 11:07 1169.msg -rw-rw-r--. 1 eduda eduda 0 mar 4 11:07 1184.msg -rw-rw-r--. 1 eduda eduda 0 mar 4 11:07 1189.msg -rw-rw-r--. 1 eduda eduda 0 mar 4 11:07 1192.msg -rw-rw-r--. 1 eduda eduda 0 mar 4 11:07 1195.msg -rw-rw-r--. 1 eduda eduda 0 mar 4 11:07 1228.msg
[3]
ls -l largemessages1 celkom 2624 -rw-rw-r--. 1 eduda eduda 409605 mar 4 11:07 1162.msg -rw-rw-r--. 1 eduda eduda 409615 mar 4 11:07 1169.msg -rw-rw-r--. 1 eduda eduda 409617 mar 4 11:07 1184.msg -rw-rw-r--. 1 eduda eduda 413163 mar 4 11:07 1189.msg -rw-rw-r--. 1 eduda eduda 204800 mar 4 11:07 1192.msg -rw-rw-r--. 1 eduda eduda 409605 mar 4 11:07 1195.msg -rw-rw-r--. 1 eduda eduda 409617 mar 4 11:07 1228.msg
- is blocked by
-
WFLY-6846 Redistribution loses large messages when server with HA is restarted
- Closed
- is cloned by
-
JBEAP-3675 (7.0.z) Redistribution loses large messages when server with HA is restarted
- Verified
- is incorporated by
-
JBEAP-5256 (7.1.0) Upgrade Artemis from 1.1.0.SP17 to 1.1.0.SP18
- Verified