-
Task
-
Resolution: Done
-
Major
-
Future.GA
-
False
-
False
There seems to be some problem with UserOperator/Kafka while creating more KafkaUsers at once.
The scenario is simple. Create 100 KUs from template below.
$ cat kafka-user-template.yaml apiVersion: kafka.strimzi.io/v1beta2 kind: KafkaUser metadata: labels: strimzi.io/cluster: user-cluster-name name: my-user-000 namespace: user-cluster-test spec: authentication: type: scram-sha-512 status: secret: my-user-000 username: my-user-000
# Create users for i in $(seq 0 99); do cp kafka-user-template.yaml kafkauser-$(printf "%.3d\n" $i).yaml; done; # make correct numbers in files for i in $(seq 0 99); do gsed -i "/000/s/000/$(printf "%.3d\n" $i)/" kafkauser-$(printf "%.3d\n" $i).yaml; done; # Create users, wait for ready state after creation for i in $(seq 0 99); do number=$(printf "%.3d\n" $i); kubectl apply -f kafkauser-$number.yaml -n user-cluster-test; kubectl wait --for=condition=ready kafkauser/my-user-$number -n user-cluster-test; done; # Teardown for i in $(seq 0 99); do number=$(printf "%.3d\n" $i); kubectl delete -f kafkauser-$number.yaml -n user-cluster-test; done;
The problem seems to be when UO timeouts KU creation request. As another request comes in to create another KU (which timeouts as well), it might overwhelm UO after some time.
Effectively putting all present KafkaUsers from Ready state to not Ready state after some time.
This scenario bug was triggered even when creating such users in bulk of 20s. After 1st bulk, all 20 users were in ready state. After Creating user ~32-34 timeouts kicked in all KU become not-ready.
This is the only relevant debug log I found while watching UO logs:
2021-09-13 10:22:46 DEBUG NetworkClient:880 - [AdminClient clientId=adminclient-1] Received DESCRIBE_CLIENT_QUOTAS response from node 0 for request with header RequestHeader(apiKey=DESCRIBE_CLIENT_QUOTAS, apiVersion=1, clientId=adminclient-1, correlationId=3089): DescribeClientQuotasResponseData(throttleTimeMs=0, errorCode=0, errorMessage='', entries=[]) 2021-09-13 10:22:46 DEBUG KafkaAdminClient:1099 - [AdminClient clientId=adminclient-1] Sending DescribeClientQuotasRequestData(components=[ComponentData(entityType='user', matchType=0, match='CN=my-user-402420895-684675850-11')], strict=false) to user-cluster-name-kafka-0.user-cluster-name-kafka-brokers.user-cluster-test.svc:9091 (id: 0 rack: null). correlationId=3090 2021-09-13 10:22:46 DEBUG NetworkClient:522 - [AdminClient clientId=adminclient-1] Sending DESCRIBE_CLIENT_QUOTAS request with header RequestHeader(apiKey=DESCRIBE_CLIENT_QUOTAS, apiVersion=1, clientId=adminclient-1, correlationId=3090) and timeout 1365 to node 0: DescribeClientQuotasRequestData(components=[ComponentData(entityType='user', matchType=0, match='CN=my-user-402420895-684675850-11')], strict=false) 2021-09-13 10:22:46 DEBUG QuotasOperator:58 - Reconciliation #1115(watch) KafkaUser(user-cluster-test/my-user-402420895-684675850-54): No expected quotas and no existing quotas -> NoOp 2021-09-13 10:22:46 DEBUG KafkaAdminClient:815 - [AdminClient clientId=adminclient-1] Call(callName=alterUserScramCredentials, deadlineMs=1631528566317, tries=1, nextAllowedTryMs=1631528566418) timed out at 1631528566318 after 1 attempt(s) java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: alterUserScramCredentials at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?] at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?] at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?] at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?] at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?] at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2021-09-13 10:22:46 DEBUG KafkaAdminClient:997 - [AdminClient clientId=adminclient-1] Timed out 1 call(s) with assigned nodes. 2021-09-13 10:22:46 ERROR AbstractOperator:240 - Reconciliation #1115(watch) KafkaUser(user-cluster-test/my-user-402420895-684675850-54): createOrUpdate failed org.apache.kafka.common.errors.TimeoutException: Call(callName=alterUserScramCredentials, deadlineMs=1631528566317, tries=1, nextAllowedTryMs=1631528566418) timed out at 1631528566318 after 1 attempt(s) Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: alterUserScramCredentials 2021-09-13 10:22:46 DEBUG CustomResource:172 - Calling CustomResource#setKind doesn't do anything because the Kind is computed and shouldn't be changed 2021-09-13 10:22:46 DEBUG StatusDiff:42 - Ignoring Status diff {"op":"replace","path":"/conditions/0/lastTransitionTime","value":"2021-09-13T10:22:46.318848Z"} 2021-09-13 10:22:46 DEBUG StatusDiff:46 - Status differs: {"op":"replace","path":"/conditions/0/message","value":"Call(callName=alterUserScramCredentials, deadlineMs=1631528566317, tries=1, nextAllowedTryMs=1631528566418) timed out at 1631528566318 after 1 attempt(s)"}
Important note1: This is not happening on OCP. Only on minikube. (Haven't tested other kube distributions).
Note2: I have noticed some timeout happened while leader voting happened (despite having only 1 ZK and 1 K)
Note3: Disabled some tests in systemtest/src/test/java/io/strimzi/systemtest/operators/user/UserST.java
- relates to
-
ENTMQST-1837 Use Kafka Admin API for managing SCRAM users in User Operator
- Closed