-
Bug
-
Resolution: Duplicate
-
Normal
-
None
-
4.14
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
No
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem
Provisioning a load balancer for a Kubernetes service object with type: LoadBalancer can take upwards of 5 minutes on GCP when many LBs are provisioned simultaneously. Deleting the services can also take several minutes.
Version-Release number of selected component (if applicable)
I have seen this impacting 4.14 CI jobs, and I have reproduced the issue with 4.14.0-0.nightly-2023-08-11-055332.
How reproducible
I can reproduce this issue reliably when creating 10 or more LBs at once.
Steps to Reproduce
1. Create 10 or more services with type: LoadBalancer in parallel.
2. Watch the services' statuses.
To do this programmatically, I defined some shell functions:
setup() {
echo "Creating namespace..."
oc create namespace test
}
cleanup() {
local start=$SECONDS
echo "Deleting namespace..."
oc delete namespace test
echo "Namespace was deleted in $((SECONDS-start)) seconds."
}
lb_test() {
local lb_name=$1
echo "Creating service $lb_name..."
oc create -f - <<-EOF
apiVersion: v1
kind: Service
metadata:
name: $lb_name
namespace: test
spec:
ports:
- port: 80
type: LoadBalancer
EOF
local start=$SECONDS
while [[ -z "$(oc -n test get "services/$lb_name" -o 'jsonpath={.status.loadBalancer.ingress[*].ip}')" ]]
do sleep 1
done
echo "Service load-balancer for $lb_name was provisioned in $((SECONDS-start)) seconds."
}
See "Actual results" for usage.
Actual results
For some of the services, provisioning the load balancer can take over 5 minutes. Deleting the namespace also takes several minutes. For example:
$ setup && for i in {1..10} ; do lb_test "test$i" & done ; wait
Creating namespace...
namespace/test created
[1] 3810082
[2] 3810083
Creating service test1...
[3] 3810084
Creating service test2...
[4] 3810086
Creating service test3...
Creating service test4...
[5] 3810088
[6] 3810091
Creating service test5...
Creating service test6...
[7] 3810092
Creating service test7...
[8] 3810095
[9] 3810097
Creating service test8...
[10] 3810098
Creating service test9...
Creating service test10...
service/test1 created
service/test3 created
service/test10 created
service/test2 created
service/test4 created
service/test7 created
service/test5 created
service/test8 created
service/test9 created
service/test6 created
Service load-balancer for test1 was provisioned in 41 seconds.
[1] Done lb_test "test$i"
Service load-balancer for test3 was provisioned in 70 seconds.
Service load-balancer for test10 was provisioned in 101 seconds.
Service load-balancer for test2 was provisioned in 134 seconds.
[2] Done lb_test "test$i"
[3] Done lb_test "test$i"
[10]+ Done lb_test "test$i"
Service load-balancer for test4 was provisioned in 165 seconds.
[4] Done lb_test "test$i"
Service load-balancer for test7 was provisioned in 198 seconds.
Service load-balancer for test5 was provisioned in 228 seconds.
[5] Done lb_test "test$i"
[7] Done lb_test "test$i"
Service load-balancer for test8 was provisioned in 259 seconds.
Unable to connect to the server: dial tcp 35.226.199.120:6443: i/o timeout
Service load-balancer for test9 was provisioned in 309 seconds.
Service load-balancer for test6 was provisioned in 323 seconds.
[6] Done lb_test "test$i"
[8]- Done lb_test "test$i"
[9]+ Done lb_test "test$i"
$ cleanup
Deleting namespace...
namespace "test" deleted
Namespace was deleted in 244 seconds.
$
Expected results
Load balancers should be provisioned more quickly.
Additional info
I will provide kube-controller-manager logs in a comment; Jira has a limit of 65,535 characters for the description, which the logs would cause this description to exceed.
This is impacting cluster-ingress-operator's e2e-gcp-operator CI job, which has 13 parallel tests that create a LoadBalancer-type service, as well as a serial test that creates 2 more. I did some analysis of one test failure resulting from timeouts waiting for LBs to be provisioned here: https://github.com/openshift/cluster-ingress-operator/pull/970#issuecomment-1672492378.