Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-7771

[LTS] large-messages/ folder starts to collect duplicate messages after a broker failover, which are never removed

XMLWordPrintable

    • False
    • None
    • False
    • -
    • ARTEMIS-4193
    • Hide

      This problem can be reproduced using a two-node broker mesh, where both nodes run on the same host. The producer and consumer applications can run on the same host. Although originally reported on OpenShift, the problem reproduces easily on a single, local machine.

      The producer and consumer applications use Camel. Despite a great deal of effort, I have not been able to reproduce the problem using simple, JMS-only applications. Similarly, although the sequence of broker failover steps appears to be redundant, I have been unable to reproduce the problem without this failover, or in any circumstances where there is only a single broker.

      1. Install AMQ 7.10.2 and apply patch 3238 (which is supposed to fix the 'remote did not respond to a drain request' message.

      2. Create a two-node mesh, with all other settings default. I will call the two nodes 'node1' and 'node2'.

      3. Perhaps, for a sanity check, ensure that a message produce to node1's AMQP port can be collected from node2's AMQP port – so the network connectors are operational.

      4. Unpack `amqp-pubsub-local-20230215.zip`. This gives a producer application and a consumer application. Both are Spring Boot apps.

      5. In each of the producer and consumer apps, edit `src/main/resources/application.properties`, to set the correct ports for the failover URL for the brokers. The brokers must be listed in the same order in both applications, because only one broker will be running at a time. I am listing the AMQP ports in the order 'node1, node2'.

      6. Ensure node1 and node2 brokers are both running, although the producer and consumer will initially connect to only one broker.

      7. Start the producer and consumer apps using `mvn compile spring-boot:run`

      8. When the apps show that a few hundred messages have been produced and consumed, stop node1.

      9. Producer and consumer will switch over to node2, and continue.

      10. After another few hundred messages have been produced and consumed, start node1, then stop node 2. The producer and consumer will switch over to node1.

      Now node1 is running, and node2 is not.

      11. Allow the producer to run until all 10 000 messages have been sent. At this point, the consumer will probably not have been received all 10 000, because some will be stuck on node2, which is down.

      12 .Start node2. Wait until the consumer app reports that it has received 10 000 messages.

      At this point, 'artemis queue stat' shows no messages on any queue of either broker. However, I find at least a few files in the `large-messages/` folder on both brokers. These are never removed.

      Note: it probably isn't necessary to use 10 000 messages – I see the problem if I only send 2 000.

      Show
      This problem can be reproduced using a two-node broker mesh, where both nodes run on the same host. The producer and consumer applications can run on the same host. Although originally reported on OpenShift, the problem reproduces easily on a single, local machine. The producer and consumer applications use Camel. Despite a great deal of effort, I have not been able to reproduce the problem using simple, JMS-only applications. Similarly, although the sequence of broker failover steps appears to be redundant, I have been unable to reproduce the problem without this failover, or in any circumstances where there is only a single broker. 1. Install AMQ 7.10.2 and apply patch 3238 (which is supposed to fix the 'remote did not respond to a drain request' message. 2. Create a two-node mesh, with all other settings default. I will call the two nodes 'node1' and 'node2'. 3. Perhaps, for a sanity check, ensure that a message produce to node1's AMQP port can be collected from node2's AMQP port – so the network connectors are operational. 4. Unpack `amqp-pubsub-local-20230215.zip`. This gives a producer application and a consumer application. Both are Spring Boot apps. 5. In each of the producer and consumer apps, edit `src/main/resources/application.properties`, to set the correct ports for the failover URL for the brokers. The brokers must be listed in the same order in both applications, because only one broker will be running at a time. I am listing the AMQP ports in the order 'node1, node2'. 6. Ensure node1 and node2 brokers are both running, although the producer and consumer will initially connect to only one broker. 7. Start the producer and consumer apps using `mvn compile spring-boot:run` 8. When the apps show that a few hundred messages have been produced and consumed, stop node1. 9. Producer and consumer will switch over to node2, and continue. 10. After another few hundred messages have been produced and consumed, start node1, then stop node 2. The producer and consumer will switch over to node1. Now node1 is running, and node2 is not. 11. Allow the producer to run until all 10 000 messages have been sent. At this point, the consumer will probably not have been received all 10 000, because some will be stuck on node2, which is down. 12 .Start node2. Wait until the consumer app reports that it has received 10 000 messages. At this point, 'artemis queue stat' shows no messages on any queue of either broker. However, I find at least a few files in the `large-messages/` folder on both brokers. These are never removed. Note: it probably isn't necessary to use 10 000 messages – I see the problem if I only send 2 000.

      This problem affects large messages published using an AMQP client, and consumed also using an AMQP client.

      It only (so far as I can tell) affects broker mesh operation – I cannot reproduce it on a single broker. In addition, it only reproduces when there has been a fail-over between the brokers, with the consumer connected.

      My test produces 10 000 messages to a single queue, and consume 10 000 messages. Both the producer and the consumer are at all times connected to the same broker, but the brokers are shut down alternately during the test, so both producer and consumer transfer from one broker to the other.

      At the end of the test, `artemis queue stat` reveals no messages in any queue on either broker, and yet there will always be at least a few files in `large-messages/` in one or both brokers. In the worst cases, I have seen over a thousand files stranded in `large-messages/`.

      I must stress that no messages are lost – the test does not complete until all messages are sent and received. The test applications keep track of the number of messages, so there is no doubt. The problem is that, over time, huge numbers of files can accumulate in `large-messages/`, and they are not removed.

      They can't be removed manually, because some of these files will correspond to real messages, that are waiting to be consumed; there is no easy way to tell which files correspond to real messages, and which do not.

              csuconic@redhat.com Clebert Suconic
              rhn-support-kboone Kevin Boone
              Tiago Bueno Tiago Bueno
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: