Uploaded image for project: 'Kogito'
  1. Kogito
  2. KOGITO-4971

Data index task events error management might produce unreliable scenarios?

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • Data Index
    • None
    • False
    • False
    • Undefined

    Description

      Hi cnicolai,
      sorry for the long explanation below but I believe that there is some behavior that could make sense to take a look at.

      In a random error that I can't reproduce right now, I came into the scenario below that I'll describe in the lines below.
      However by forcing the code to fail (e.g. if (processId == A)

      { //emulate error by throwing an exception}

      ), the problem is reproducible in terms about what happens in the case that an event processing fails.

      This is the scenario:

      1) I have ProcessA that generates TaskA, and ProcessB that generates TaskB

      2) data index is up and running

      3) An instance of ProcessA is created, an instance of TaskAInstance1 is created, and the corresponding TaskAInstance1Event arrives to the data-index

      4) An Infinispan error occurres. (or whatever any other error when the event is being processed)

      5) the event loop enters into this line

      https://github.com/kiegroup/kogito-apps/blob/8e04ea41dcc99df454fa8dcc958ee64618f8d51d/data-index/data-index-service/data-index-service-common/src/main/java/org/kie/kogito/index/messaging/ReactiveMessagingEventConsumer.java#L118

      6) the just generated TaskAInstance1 information is not index. I you query the index there are no TaskA instances.

      7) after some milliseconds lines like this are shown in the log.

      2021-04-19 10:55:22,648 WARN [io.sma.rea.mes.kafka] (vert.x-eventloop-thread-4) SRMSG18231: The amount of received messages without acking is too high for topic partition 'TopicPartition

      {topic=kogito-usertaskinstances-events, partition=0}

      ', amount 1. The last committed offset was -1. It means that the Kafka connector received Kafka Records that have neither be acked nor nacked in a timely fashion. The connector accumulates records in memory, but that buffer reached its maximum capacity or the deadline for ack/nack expired. The connector cannot commit as a record processing has not completed.
      2021-04-19 10:55:22,650 WARN [io.sma.rea.mes.kafka] (vert.x-eventloop-thread-0) SRMSG18228: A failure has been reported for Kafka topics '[kogito-processinstances-events]': io.smallrye.reactive.messaging.kafka.commit.KafkaThrottledLatestProcessedCommit$TooManyMessagesWithoutAckException: Too Many Messages without acknowledgement in topic kogito-processinstances-events (partition:0), last committed offset is -1

      2021-04-19 10:55:22,654 INFO [org.apa.kaf.cli.con.int.AbstractCoordinator] (vert.x-kafka-consumer-thread-0) [Consumer clientId=kafka-consumer-kogito-usertaskdomain-events, groupId=kogito-data-index-usertaskdomain] Member kafka-consumer-kogito-usertaskdomain-events-09e0e289-b926-4f0a-9773-744d6913ca8f sending LeaveGroup request to coordinator localhost:9092 (id: 2147483647 rack: null) due to the consumer is being closed

      8) the data-index continues operating well, in the sense that you can execute queries, etc, with no problems. But information about TaskAInstance1 is not present.

      9) now I've create a couple of instances of ProcessB, that generates two events in the runtime side TaskBInstance1Event, TaskBInstance2Event

      10) This events don't arrive to the data-index, no processing is done.

      11) users can still query the data index, etc. It can basically continue running for an arbitrary amount of time with no problem. Information about the tasks above is not present.

      12) at some point I've shut down the data index

      13) I've start the data-index again

      14) Now TaskAInstanceEvent1, TaskBInstance1Event1 and TaskBInstanceEvent2 events arrives again.

      (in the error TaskAInstance1Event fails again. However, this time TaskBInstance1Event, TaskBInstance2Event are processed)

      So this time users can query the data-index and get information about TaskBInstance1 and TaskBInstance2. But no information about TaskAInstance1 is present.

      15) the data index continues operating well.

      16) same logs are printed regarding kafka

      17) after the kafka timeouts the data index receives no more events.

      I believe that the "problem" we have here is that the data-index is "failing silently" (when the even processing failed) and thus continues operating "well" in the sense that an external component that executes queries can still continue querying, but the data-index enters somehow in a corrupt status. Since after the failure point no more events are received.

      Does it make sense?

      Not sure exactly the best path to go with here, but I think this is something that could make sense to have a look at.

      From the top of my head, maybe it could make sense to combine any of the kafka nacking alternatives (possibly just fail fast, or the death letter queues approach) and let the index fail fast in this situations. And thus external applications might know that the data-index got into a corrupt status. This last could be combined with a negative health check mabye, to let external monitoring mechanisms to be aware of this situation too.

      wdyt?

      Attachments

        Issue Links

          Activity

            People

              cnicolai@redhat.com Cristiano Nicolai
              wmedvede@redhat.com Walter Medvedeo
              Dominik Hanak Dominik Hanak
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated: