Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18954

Many SNOs failed to complete install because "the cluster operator cluster-autoscaler is not available"

XMLWordPrintable

    • No
    • CLOUD Sprint 242, CLOUD Sprint 243
    • 2
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, some conditions during the startup process of the Cluster Autoscaler Operator caused a lock that prevented the Operator from successfully starting and marking itself available. As a result, the cluster became degraded. The issue is resolved with this release. (link:https://issues.redhat.com/browse/OCPBUGS-18954[*OCPBUGS-18954*])
      Show
      * Previously, some conditions during the startup process of the Cluster Autoscaler Operator caused a lock that prevented the Operator from successfully starting and marking itself available. As a result, the cluster became degraded. The issue is resolved with this release. (link: https://issues.redhat.com/browse/OCPBUGS-18954 [* OCPBUGS-18954 *])
    • Bug Fix
    • Done
    • 9/19: telco prioritization pending triage

      Description of problem:

      While installing 3618 SNOs via ZTP using ACM 2.9, 15 clusters failed to complete install and have failed on the cluster-autoscaler operator. This represents the bulk of all cluster install failures in this testbed for OCP 4.14.0-rc.0.
      
      
      # cat aci.InstallationFailed.autoscaler  | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers "
      vm00527 version         False   True   20h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm00717 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm00881 version         False   True   19h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm00998 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01006 version         False   True   17h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01059 version         False   True   15h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01155 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm01930 version         False   True   17h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm02407 version         False   True   16h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm02651 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03073 version         False   True   19h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03258 version         False   True   20h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03295 version         False   True   14h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03303 version         False   True   15h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      vm03517 version         False   True   18h   Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
      
      

      Version-Release number of selected component (if applicable):

      Hub 4.13.11
      Deployed SNOs 4.14.0-rc.0
      ACM 2.9 - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52

      How reproducible:

      15 out of 20 failures (75% of the failures)
      15 out of 3618 total attempted SNOs to be installed ~.4% of all installs

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

      It appears that some show in the logs of the cluster-autoscaler-operator an error, Example:
      
      I0912 19:54:39.962897       1 main.go:15] Go Version: go1.20.5 X:strictfipsruntime
      I0912 19:54:39.962977       1 main.go:16] Go OS/Arch: linux/amd64
      I0912 19:54:39.962982       1 main.go:17] Version: cluster-autoscaler-operator v4.14.0-202308301903.p0.gb57f5a9.assembly.stream-dirty
      I0912 19:54:39.963137       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
      I0912 19:54:39.975478       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="127.0.0.1:9191"
      I0912 19:54:39.976939       1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-clusterautoscalers"
      I0912 19:54:39.976984       1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-machineautoscalers"
      I0912 19:54:39.977082       1 main.go:41] Starting cluster-autoscaler-operator
      I0912 19:54:39.977216       1 server.go:216] controller-runtime/webhook/webhooks "msg"="Starting webhook server" 
      I0912 19:54:39.977693       1 certwatcher.go:161] controller-runtime/certwatcher "msg"="Updated current TLS certificate" 
      I0912 19:54:39.977813       1 server.go:273] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=8443
      I0912 19:54:39.977938       1 certwatcher.go:115] controller-runtime/certwatcher "msg"="Starting certificate watcher" 
      I0912 19:54:39.978008       1 server.go:50]  "msg"="starting server" "addr"={"IP":"127.0.0.1","Port":9191,"Zone":""} "kind"="metrics" "path"="/metrics"
      I0912 19:54:39.978052       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader...
      I0912 19:54:39.982052       1 leaderelection.go:255] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader
      I0912 19:54:39.983412       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ClusterAutoscaler"
      I0912 19:54:39.983462       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Deployment"
      I0912 19:54:39.983483       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Service"
      I0912 19:54:39.983501       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ServiceMonitor"
      I0912 19:54:39.983520       1 controller.go:177]  "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.PrometheusRule"
      I0912 19:54:39.983532       1 controller.go:185]  "msg"="Starting Controller" "controller"="cluster_autoscaler_controller"
      I0912 19:54:39.986041       1 controller.go:177]  "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *v1beta1.MachineAutoscaler"
      I0912 19:54:39.986065       1 controller.go:177]  "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *unstructured.Unstructured"
      I0912 19:54:39.986072       1 controller.go:185]  "msg"="Starting Controller" "controller"="machine_autoscaler_controller"
      I0912 19:54:40.095808       1 webhookconfig.go:72] Webhook configuration status: created
      I0912 19:54:40.101613       1 controller.go:219]  "msg"="Starting workers" "controller"="cluster_autoscaler_controller" "worker count"=1
      I0912 19:54:40.102857       1 controller.go:219]  "msg"="Starting workers" "controller"="machine_autoscaler_controller" "worker count"=1
      E0912 19:58:48.113290       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
      E0912 20:02:48.135610       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
      E0913 13:49:02.118757       1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
      
      
      

        1. must-gather-vm00527.tar.gz
          34.80 MB
        2. must-gather-vm00717.tar-1.gz
          30.32 MB
        3. must-gather-vm00881.tar.gz
          37.20 MB
        4. must-gather-vm00998.tar.gz
          34.32 MB
        5. must-gather-vm01155.tar.gz
          30.25 MB
        6. must-gather-vm01930.tar.gz
          34.84 MB
        7. must-gather-vm01059.tar.gz
          31.54 MB
        8. must-gather-vm01006.tar.gz
          35.69 MB
        9. must-gather-vm02651.tar.gz
          33.57 MB
        10. must-gather-vm02407.tar.gz
          29.71 MB
        11. must-gather-vm03258.tar.gz
          38.07 MB
        12. must-gather-vm03073.tar.gz
          40.27 MB
        13. must-gather-vm03295.tar.gz
          28.56 MB
        14. must-gather-vm03517.tar.gz
          30.24 MB
        15. must-gather-vm03303.tar.gz
          31.27 MB
        16. vm02706-autoscaler-upgrade-fail.tar.gz
          44.99 MB

            joelspeed Joel Speed
            akrzos@redhat.com Alex Krzos
            Zhaohua Sun Zhaohua Sun
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: