Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17670

GCP load-balancers can take over 5 minutes to provision when many are created in parallel

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • No
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem

      Provisioning a load balancer for a Kubernetes service object with type: LoadBalancer can take upwards of 5 minutes on GCP when many LBs are provisioned simultaneously. Deleting the services can also take several minutes.

      Version-Release number of selected component (if applicable)

      I have seen this impacting 4.14 CI jobs, and I have reproduced the issue with 4.14.0-0.nightly-2023-08-11-055332.

      How reproducible

      I can reproduce this issue reliably when creating 10 or more LBs at once.

      Steps to Reproduce

      1. Create 10 or more services with type: LoadBalancer in parallel.
      2. Watch the services' statuses.

      To do this programmatically, I defined some shell functions:

      setup() {
      echo "Creating namespace..."
      oc create namespace test
      }
      cleanup() {
      local start=$SECONDS
      echo "Deleting namespace..."
      oc delete namespace test
      echo "Namespace was deleted in $((SECONDS-start)) seconds."
      }
      lb_test() {
      local lb_name=$1
      echo "Creating service $lb_name..."
      oc create -f - <<-EOF
      apiVersion: v1
      kind: Service
      metadata:
        name: $lb_name
        namespace: test
      spec:
        ports:
        - port: 80
        type: LoadBalancer
      EOF
      local start=$SECONDS
      while [[ -z "$(oc -n test get "services/$lb_name" -o 'jsonpath={.status.loadBalancer.ingress[*].ip}')" ]]
      do sleep 1
      done
      echo "Service load-balancer for $lb_name was provisioned in $((SECONDS-start)) seconds."
      }
      

      See "Actual results" for usage.

      Actual results

      For some of the services, provisioning the load balancer can take over 5 minutes. Deleting the namespace also takes several minutes. For example:

      $ setup && for i in {1..10} ; do lb_test "test$i" & done ; wait
      Creating namespace...
      namespace/test created
      [1] 3810082
      [2] 3810083
      Creating service test1...
      [3] 3810084
      Creating service test2...
      [4] 3810086
      Creating service test3...
      Creating service test4...
      [5] 3810088
      [6] 3810091
      Creating service test5...
      Creating service test6...
      [7] 3810092
      Creating service test7...
      [8] 3810095
      [9] 3810097
      Creating service test8...
      [10] 3810098
      Creating service test9...
      Creating service test10...
      service/test1 created
      service/test3 created
      service/test10 created
      service/test2 created
      service/test4 created
      service/test7 created
      service/test5 created
      service/test8 created
      service/test9 created
      service/test6 created
      Service load-balancer for test1 was provisioned in 41 seconds.
      [1]   Done                    lb_test "test$i"
      Service load-balancer for test3 was provisioned in 70 seconds.
      Service load-balancer for test10 was provisioned in 101 seconds.
      Service load-balancer for test2 was provisioned in 134 seconds.
      [2]   Done                    lb_test "test$i"
      [3]   Done                    lb_test "test$i"
      [10]+  Done                    lb_test "test$i"
      Service load-balancer for test4 was provisioned in 165 seconds.
      [4]   Done                    lb_test "test$i"
      Service load-balancer for test7 was provisioned in 198 seconds.
      Service load-balancer for test5 was provisioned in 228 seconds.
      [5]   Done                    lb_test "test$i"
      [7]   Done                    lb_test "test$i"
      Service load-balancer for test8 was provisioned in 259 seconds.
      Unable to connect to the server: dial tcp 35.226.199.120:6443: i/o timeout
      Service load-balancer for test9 was provisioned in 309 seconds.
      Service load-balancer for test6 was provisioned in 323 seconds.
      [6]   Done                    lb_test "test$i"
      [8]-  Done                    lb_test "test$i"
      [9]+  Done                    lb_test "test$i"
      $ cleanup
      Deleting namespace...
      namespace "test" deleted
      Namespace was deleted in 244 seconds.
      $ 
      

      Expected results

      Load balancers should be provisioned more quickly.

      Additional info

      I will provide kube-controller-manager logs in a comment; Jira has a limit of 65,535 characters for the description, which the logs would cause this description to exceed.

      This is impacting cluster-ingress-operator's e2e-gcp-operator CI job, which has 13 parallel tests that create a LoadBalancer-type service, as well as a serial test that creates 2 more. I did some analysis of one test failure resulting from timeouts waiting for LBs to be provisioned here: https://github.com/openshift/cluster-ingress-operator/pull/970#issuecomment-1672492378.

              joelspeed Joel Speed
              mmasters1@redhat.com Miciah Masters
              None
              None
              Milind Yadav Milind Yadav
              None
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: