Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20038

Many SNOs failed to complete install because "the cluster operator cluster-autoscaler is not available"

XMLWordPrintable

    • No
    • CLOUD Sprint 243
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, some conditions during the startup process of the Cluster Autoscaler Operator caused a lock that prevented the Operator from successfully starting and marking itself available. As a result, the cluster became degraded. The issue is resolved with this release. (link:https://issues.redhat.com/browse/OCPBUGS-20038[*OCPBUGS-20038*])
      Show
      * Previously, some conditions during the startup process of the Cluster Autoscaler Operator caused a lock that prevented the Operator from successfully starting and marking itself available. As a result, the cluster became degraded. The issue is resolved with this release. (link: https://issues.redhat.com/browse/OCPBUGS-20038 [* OCPBUGS-20038 *])
    • Bug Fix
    • Done
    • 9/19: telco prioritization pending triage

      This is a clone of issue OCPBUGS-18954. The following is the description of the original issue:

      Description of problem:

      While installing 3618 SNOs via ZTP using ACM 2.9, 15 clusters failed to complete install and have failed on the cluster-autoscaler operator. This represents the bulk of all cluster install failures in this testbed for OCP 4.14.0-rc.0.
      
      
      # cat aci.InstallationFailed.autoscaler  | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers "
      vm00527 version         False   True   20h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm00717 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm00881 version         False   True   19h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm00998 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01006 version         False   True   17h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01059 version         False   True   15h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01155 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01930 version         False   True   17h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm02407 version         False   True   16h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm02651 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03073 version         False   True   19h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03258 version         False   True   20h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03295 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03303 version         False   True   15h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03517 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      
      

      Version-Release number of selected component (if applicable):

      Hub 4.13.11
      Deployed SNOs 4.14.0-rc.0
      ACM 2.9 - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

      How reproducible:

      15 out of 20 failures (75% of the failures)
      15 out of 3618 total attempted SNOs to be installed ~.4% of all installs

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      It appears that some show in the logs of the cluster-autoscaler-operator an error, Example:
      
      I0912 19:54:39.962897       1 main.go:15] Go Version: go1.20.5 X:strictfipsruntime
      I0912 19:54:39.962977       1 main.go:16] Go OS/Arch: linux/amd64
      I0912 19:54:39.962982       1 main.go:17] Version: cluster-autoscaler-operator v4.14.0-202308301903.p0.gb57f5a9.assembly.stream-dirty
      I0912 19:54:39.963137       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
      I0912 19:54:39.975478       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="127.0.0.1:9191"
      I0912 19:54:39.976939       1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-clusterautoscalers"
      I0912 19:54:39.976984       1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-machineautoscalers"
      I0912 19:54:39.977082       1 main.go:41] Starting cluster-autoscaler-operator
      I0912 19:54:39.977216       1 server.go:216] controller-runtime/webhook/webhooks "msg"="Starting webhook server" 
      I0912 19:54:39.977693       1 certwatcher.go:161] controller-runtime/certwatcher "msg"="Updated current TLS certificate" 
      I0912 19:54:39.977813       1 server.go:273] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=8443
      I0912 19:54:39.977938       1 certwatcher.go:115] controller-runtime/certwatcher "msg"="Starting certificate watcher" 
      I0912 19:54:39.978008       1 server.go:50]  "msg"="starting server" "addr"={"IP":"127.0.0.1","Port":9191,"Zone":""} "kind"="metrics" "path"="/metrics"
      I0912 19:54:39.978052       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader...
      I0912 19:54:39.982052       1 leaderelection.go:255] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader
      I0912 19:54:39.983412       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ClusterAutoscaler"
      I0912 19:54:39.983462       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Deployment"
      I0912 19:54:39.983483       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Service"
      I0912 19:54:39.983501       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ServiceMonitor"
      I0912 19:54:39.983520       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.PrometheusRule"
      I0912 19:54:39.983532       1 controller.go:185]  "msg"="Starting Controller" "controller"="cluster_autoscaler_controller"
      I0912 19:54:39.986041       1 controller.go:177]  "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *v1beta1.MachineAutoscaler"
      I0912 19:54:39.986065       1 controller.go:177]  "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *unstructured.Unstructured"
      I0912 19:54:39.986072       1 controller.go:185]  "msg"="Starting Controller" "controller"="machine_autoscaler_controller"
      I0912 19:54:40.095808       1 webhookconfig.go:72] Webhook configuration status: created
      I0912 19:54:40.101613       1 controller.go:219]  "msg"="Starting workers" "controller"="cluster_autoscaler_controller" "worker count"=1
      I0912 19:54:40.102857       1 controller.go:219]  "msg"="Starting workers" "controller"="machine_autoscaler_controller" "worker count"=1
      E0912 19:58:48.113290       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
      E0912 20:02:48.135610       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
      E0913 13:49:02.118757       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
      
      
      

              joelspeed Joel Speed
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: