Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-6416

Missed Acknowledgements for MQTT Messages Published with QoS 2 (EXACTLY_ONCE)

    XMLWordPrintable

Details

    • False
    • None
    • False
    • Hide

      I was able to reproduce this in a local environment by setting up a relatively vanilla 2-node broker cluster (not sure this is strictly necessary). Memory was set to 4G min/ max on both brokers. To see the issue in the logs, enable trace logging for mqtt:

      logger.org.apache.activemq.artemis.core.protocol.mqtt.level=TRACE
      

      To reproduce, set up the broker cluster, then extract the attached consumer and producer applications locally.

      On the node1 broker (or even both), try modifying the network to introduce some packet delay and loss (as root):

      [root@node1 ~]# tc qdisc add dev eth0 root netem delay 600ms 200ms loss 5% 25% distribution normal
      

      This may not be strictly necessary as I was able to reproduce the issue once or twice without it, but reproduction was more consistent with it. If possible, use a multi-homed host for this, so the broker can be configured to use the modified interface, while ssh / scp can use the other interface.

      1. Modify the consumer jndi.properties to update the remote.address property to correspond to the node2 broker in the cluster and start the consumer.

      2. modify the producer jndi.properties to update the remote.address property to correspond to the node1 broker in the cluster and start the producer.

      3. Wait until the producer seems to stop producing messages. If the producer exits cleanly, it probably means the issue didn't reproduce.

      4. Check the producer log using grep to check for equal / unequal PUBLISH vs. PUBREC events:

      cat ../log/artemis.log | grep PUBLISH | grep \ IN\ \<\< | wc -l
      cat ../log/artemis.log | grep PUBREC | grep \ OUT\ \>\> | wc -l
      

      If these numbers are unequal, there should be a hung thread in the producer application. If the broker is restarted, the producer should resume production and finish the batch.

      Show
      I was able to reproduce this in a local environment by setting up a relatively vanilla 2-node broker cluster (not sure this is strictly necessary). Memory was set to 4G min/ max on both brokers. To see the issue in the logs, enable trace logging for mqtt: logger.org.apache.activemq.artemis.core.protocol.mqtt.level=TRACE To reproduce, set up the broker cluster, then extract the attached consumer and producer applications locally. On the node1 broker (or even both), try modifying the network to introduce some packet delay and loss (as root): [root@node1 ~]# tc qdisc add dev eth0 root netem delay 600ms 200ms loss 5% 25% distribution normal This may not be strictly necessary as I was able to reproduce the issue once or twice without it, but reproduction was more consistent with it. If possible, use a multi-homed host for this, so the broker can be configured to use the modified interface, while ssh / scp can use the other interface. 1. Modify the consumer jndi.properties to update the remote.address property to correspond to the node2 broker in the cluster and start the consumer. 2. modify the producer jndi.properties to update the remote.address property to correspond to the node1 broker in the cluster and start the producer. 3. Wait until the producer seems to stop producing messages. If the producer exits cleanly, it probably means the issue didn't reproduce. 4. Check the producer log using grep to check for equal / unequal PUBLISH vs. PUBREC events: cat ../log/artemis.log | grep PUBLISH | grep \ IN\ \<\< | wc -l cat ../log/artemis.log | grep PUBREC | grep \ OUT\ \>\> | wc -l If these numbers are unequal, there should be a hung thread in the producer application. If the broker is restarted, the producer should resume production and finish the batch.

    Description

      When publishing messages to the broker from an external client (.Net MQTT), with QoS 2, sometimes the broker fails to acknowledge the message with a PUBREC, resulting in a timeout / resend and causing broken SLAs. After enabling TRACE logging for org.apache.activemq.artemis.core.protocol.mqtt we observed that the number of PUBRECs logged is also lower than the number of inbound PUBLISH events logged:

      cat artemis.log | grep -a PUBLISH | grep -a \ IN\  | wc -l
      cat artemis.log | grep -a PUBREC | grep -a \ OUT\  | wc -l
      

      So this appears to be a result of the broker failing to ack, rather than a dropped packet.

      Attachments

        Issue Links

          Activity

            People

              rhn-support-jbertram Justin Bertram
              rhn-support-dhawkins Duane Hawkins
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: