Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-5502

Streams MM2 offset synchronization is erratic in 2.5.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Critical Critical
    • None
    • 2.5.0.GA
    • kafka-broker
    • None
    • False
    • None
    • False
    • Important

      We have noticed a number of oddities in the way that offsets are synchronized by MM2, that were not present in Streams 2.4. We suspect that these oddities might be the result of work that was done in KAFKA-14666. These problems can be reproduced quite easily using two Streams clusters in different namespaces on OpenShift, if the namespaces are allowed to communicate using local services. Otherwise routes will have to be used, which means configuring TLS and certificates, which would complicate the set-up a lot.

      We have set up MM2 using the attached `mm2.yaml`.

      Here is one way to reproduce a problem, but we have seen other problems in our testing, as has the customer.

      1. Set the number of MM2 replicas to zero, so it isn't running.

      2. Send a million messages to a specific topic, using kafka-console-producer.sh.

      3. Start a consumer on that topic using kafka-console-consumer.sh using a specific consumer group ID.

      4. At the same time, monitor the consumer group using `kafka-consumer-groups.sh`. Stop when roughly half the messages are consumed. So the utility says `log length=1000000, last offset=500000, lag=500000` (for example).

      5. Stop producer and consumer

      6. Start MM2 by increasing its replica count to 1. Wait a little while – perhaps a minute or so – for the synchronization to happen (can be seen in the MM2 log)

      7. Run `kafka-consumer-groups.sh` on the target system, same consumer group. In our tests we consistently saw `length=1000000, last offset=1000000, lag=0`. That is, the offset is (it seems) at the end of the log in the target system.

      8. Try to consume messages from the target using `kakfa-console-consumer.sh`. For the consumer group ID we have been using, no message is consumed – presumably because the offset is at the end of the log.

      We appreciate that the same messages will not necessarily appear at the same offset in the mirrors Kafka cluster. However, we would expect consumers to start consuming from the same place (more or less). In all out tests, consumers do not get the messages they expect; often the offsets are wildly wrong.

      We tried setting `offset.lag.max` to zero, but it made no difference.

      Arguably, replicating a cluster with no producers and consumers is not a realistic operation. But we saw problems when the producers and consumers were running as well – they're just harder to quantify.

       

            Unassigned Unassigned
            rhn-support-kboone Kevin Boone
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: