Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-3273

UserOperator timeouts request(s) for creation of Kafka Users when more are created at once

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • 2.4.0.GA
    • Future.GA
    • user-operator

      There seems to be some problem with UserOperator/Kafka while creating more KafkaUsers at once.
      The scenario is simple. Create 100 KUs from template below.

      $ cat kafka-user-template.yaml
      apiVersion: kafka.strimzi.io/v1beta2
      kind: KafkaUser
      metadata: 
        labels: 
          strimzi.io/cluster: user-cluster-name
        name: my-user-000
        namespace: user-cluster-test
      spec: 
        authentication: 
          type: scram-sha-512
      status: 
        secret: my-user-000
        username: my-user-000
      
      # Create users
      for i in $(seq 0 99); do cp kafka-user-template.yaml kafkauser-$(printf "%.3d\n" $i).yaml; done;
      
      # make correct numbers in files
      for i in $(seq 0 99); do gsed -i "/000/s/000/$(printf "%.3d\n" $i)/" kafkauser-$(printf "%.3d\n" $i).yaml; done;
      
      # Create users, wait for ready state after creation
      for i in $(seq 0 99); do 
        number=$(printf "%.3d\n" $i);
        kubectl apply -f kafkauser-$number.yaml -n user-cluster-test;
        kubectl wait --for=condition=ready kafkauser/my-user-$number -n user-cluster-test;
      done;
      
      # Teardown
      for i in $(seq 0 99); do 
        number=$(printf "%.3d\n" $i);
        kubectl delete -f kafkauser-$number.yaml -n user-cluster-test;
      done;
      

      The problem seems to be when UO timeouts KU creation request. As another request comes in to create another KU (which timeouts as well), it might overwhelm UO after some time.
      Effectively putting all present KafkaUsers from Ready state to not Ready state after some time.

      This scenario bug was triggered even when creating such users in bulk of 20s. After 1st bulk, all 20 users were in ready state. After Creating user ~32-34 timeouts kicked in all KU become not-ready.

      This is the only relevant debug log I found while watching UO logs:

      2021-09-13 10:22:46 DEBUG NetworkClient:880 - [AdminClient clientId=adminclient-1] Received DESCRIBE_CLIENT_QUOTAS response from node 0 for request with header RequestHeader(apiKey=DESCRIBE_CLIENT_QUOTAS, apiVersion=1, clientId=adminclient-1, correlationId=3089): DescribeClientQuotasResponseData(throttleTimeMs=0, errorCode=0, errorMessage='', entries=[])
      2021-09-13 10:22:46 DEBUG KafkaAdminClient:1099 - [AdminClient clientId=adminclient-1] Sending DescribeClientQuotasRequestData(components=[ComponentData(entityType='user', matchType=0, match='CN=my-user-402420895-684675850-11')], strict=false) to user-cluster-name-kafka-0.user-cluster-name-kafka-brokers.user-cluster-test.svc:9091 (id: 0 rack: null). correlationId=3090
      2021-09-13 10:22:46 DEBUG NetworkClient:522 - [AdminClient clientId=adminclient-1] Sending DESCRIBE_CLIENT_QUOTAS request with header RequestHeader(apiKey=DESCRIBE_CLIENT_QUOTAS, apiVersion=1, clientId=adminclient-1, correlationId=3090) and timeout 1365 to node 0: DescribeClientQuotasRequestData(components=[ComponentData(entityType='user', matchType=0, match='CN=my-user-402420895-684675850-11')], strict=false)
      2021-09-13 10:22:46 DEBUG QuotasOperator:58 - Reconciliation #1115(watch) KafkaUser(user-cluster-test/my-user-402420895-684675850-54): No expected quotas and no existing quotas -> NoOp
      2021-09-13 10:22:46 DEBUG KafkaAdminClient:815 - [AdminClient clientId=adminclient-1] Call(callName=alterUserScramCredentials, deadlineMs=1631528566317, tries=1, nextAllowedTryMs=1631528566418) timed out at 1631528566318 after 1 attempt(s)
      java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: alterUserScramCredentials
      	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
      	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
      	at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
      	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
      	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
      	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
      	at java.lang.Thread.run(Thread.java:829) [?:?]
      2021-09-13 10:22:46 DEBUG KafkaAdminClient:997 - [AdminClient clientId=adminclient-1] Timed out 1 call(s) with assigned nodes.
      2021-09-13 10:22:46 ERROR AbstractOperator:240 - Reconciliation #1115(watch) KafkaUser(user-cluster-test/my-user-402420895-684675850-54): createOrUpdate failed
      org.apache.kafka.common.errors.TimeoutException: Call(callName=alterUserScramCredentials, deadlineMs=1631528566317, tries=1, nextAllowedTryMs=1631528566418) timed out at 1631528566318 after 1 attempt(s)
      Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: alterUserScramCredentials
      2021-09-13 10:22:46 DEBUG CustomResource:172 - Calling CustomResource#setKind doesn't do anything because the Kind is computed and shouldn't be changed
      2021-09-13 10:22:46 DEBUG StatusDiff:42 - Ignoring Status diff {"op":"replace","path":"/conditions/0/lastTransitionTime","value":"2021-09-13T10:22:46.318848Z"}
      2021-09-13 10:22:46 DEBUG StatusDiff:46 - Status differs: {"op":"replace","path":"/conditions/0/message","value":"Call(callName=alterUserScramCredentials, deadlineMs=1631528566317, tries=1, nextAllowedTryMs=1631528566418) timed out at 1631528566318 after 1 attempt(s)"}
      

      Important note1: This is not happening on OCP. Only on minikube. (Haven't tested other kube distributions).

      Note2: I have noticed some timeout happened while leader voting happened (despite having only 1 ZK and 1 K)
      Note3: Disabled some tests in systemtest/src/test/java/io/strimzi/systemtest/operators/user/UserST.java

            Unassigned Unassigned
            mtoth@redhat.com Michal Toth
            Lukas Kral Lukas Kral
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: