Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Cloud Compute / Cloud Controller Manager
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

Provisioning a load balancer for a Kubernetes service object with type: LoadBalancer can take upwards of 5 minutes on GCP when many LBs are provisioned simultaneously. Deleting the services can also take several minutes.

Version-Release number of selected component (if applicable)

I have seen this impacting 4.14 CI jobs, and I have reproduced the issue with 4.14.0-0.nightly-2023-08-11-055332.

How reproducible

I can reproduce this issue reliably when creating 10 or more LBs at once.

Steps to Reproduce

1. Create 10 or more services with type: LoadBalancer in parallel.
2. Watch the services' statuses.

To do this programmatically, I defined some shell functions:

setup() {
echo "Creating namespace..."
oc create namespace test
}
cleanup() {
local start=$SECONDS
echo "Deleting namespace..."
oc delete namespace test
echo "Namespace was deleted in $((SECONDS-start)) seconds."
}
lb_test() {
local lb_name=$1
echo "Creating service $lb_name..."
oc create -f - <<-EOF
apiVersion: v1
kind: Service
metadata:
  name: $lb_name
  namespace: test
spec:
  ports:
  - port: 80
  type: LoadBalancer
EOF
local start=$SECONDS
while [[ -z "$(oc -n test get "services/$lb_name" -o 'jsonpath={.status.loadBalancer.ingress[*].ip}')" ]]
do sleep 1
done
echo "Service load-balancer for $lb_name was provisioned in $((SECONDS-start)) seconds."
}

See "Actual results" for usage.

Actual results

For some of the services, provisioning the load balancer can take over 5 minutes. Deleting the namespace also takes several minutes. For example:

$ setup && for i in {1..10} ; do lb_test "test$i" & done ; wait
Creating namespace...
namespace/test created
[1] 3810082
[2] 3810083
Creating service test1...
[3] 3810084
Creating service test2...
[4] 3810086
Creating service test3...
Creating service test4...
[5] 3810088
[6] 3810091
Creating service test5...
Creating service test6...
[7] 3810092
Creating service test7...
[8] 3810095
[9] 3810097
Creating service test8...
[10] 3810098
Creating service test9...
Creating service test10...
service/test1 created
service/test3 created
service/test10 created
service/test2 created
service/test4 created
service/test7 created
service/test5 created
service/test8 created
service/test9 created
service/test6 created
Service load-balancer for test1 was provisioned in 41 seconds.
[1]   Done                    lb_test "test$i"
Service load-balancer for test3 was provisioned in 70 seconds.
Service load-balancer for test10 was provisioned in 101 seconds.
Service load-balancer for test2 was provisioned in 134 seconds.
[2]   Done                    lb_test "test$i"
[3]   Done                    lb_test "test$i"
[10]+  Done                    lb_test "test$i"
Service load-balancer for test4 was provisioned in 165 seconds.
[4]   Done                    lb_test "test$i"
Service load-balancer for test7 was provisioned in 198 seconds.
Service load-balancer for test5 was provisioned in 228 seconds.
[5]   Done                    lb_test "test$i"
[7]   Done                    lb_test "test$i"
Service load-balancer for test8 was provisioned in 259 seconds.
Unable to connect to the server: dial tcp 35.226.199.120:6443: i/o timeout
Service load-balancer for test9 was provisioned in 309 seconds.
Service load-balancer for test6 was provisioned in 323 seconds.
[6]   Done                    lb_test "test$i"
[8]-  Done                    lb_test "test$i"
[9]+  Done                    lb_test "test$i"
$ cleanup
Deleting namespace...
namespace "test" deleted
Namespace was deleted in 244 seconds.
$

Expected results

Load balancers should be provisioned more quickly.

Additional info

I will provide kube-controller-manager logs in a comment; Jira has a limit of 65,535 characters for the description, which the logs would cause this description to exceed.

This is impacting cluster-ingress-operator's e2e-gcp-operator CI job, which has 13 parallel tests that create a LoadBalancer-type service, as well as a serial test that creates 2 more. I did some analysis of one test failure resulting from timeouts waiting for LBs to be provisioned here: https://github.com/openshift/cluster-ingress-operator/pull/970#issuecomment-1672492378.

Assignee:: Joel Speed

Reporter:: Miciah Masters

Need Info From:: None

Contributors:: None

QA Contact:: Milind Yadav

Doc Contact:: None

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/08/11 10:37 PM

Updated:: 2025/07/25 11:31 PM

Resolved:: 2024/02/07 11:52 AM

Details

Description

Description of problem

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Easy Agile Planning Poker

Activity

People

Dates