-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.14
-
No
-
CLOUD Sprint 242, CLOUD Sprint 243
-
2
-
False
-
-
-
Bug Fix
-
Done
-
9/19: telco prioritization pending triage
-
Description of problem:
While installing 3618 SNOs via ZTP using ACM 2.9, 15 clusters failed to complete install and have failed on the cluster-autoscaler operator. This represents the bulk of all cluster install failures in this testbed for OCP 4.14.0-rc.0. # cat aci.InstallationFailed.autoscaler | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers " vm00527 version False True 20h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00717 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00881 version False True 19h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00998 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01006 version False True 17h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01059 version False True 15h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01155 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01930 version False True 17h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm02407 version False True 16h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm02651 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03073 version False True 19h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03258 version False True 20h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03295 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03303 version False True 15h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03517 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
Version-Release number of selected component (if applicable):
Hub 4.13.11 Deployed SNOs 4.14.0-rc.0 ACM 2.9 - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
15 out of 20 failures (75% of the failures) 15 out of 3618 total attempted SNOs to be installed ~.4% of all installs
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
It appears that some show in the logs of the cluster-autoscaler-operator an error, Example: I0912 19:54:39.962897 1 main.go:15] Go Version: go1.20.5 X:strictfipsruntime I0912 19:54:39.962977 1 main.go:16] Go OS/Arch: linux/amd64 I0912 19:54:39.962982 1 main.go:17] Version: cluster-autoscaler-operator v4.14.0-202308301903.p0.gb57f5a9.assembly.stream-dirty I0912 19:54:39.963137 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0912 19:54:39.975478 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="127.0.0.1:9191" I0912 19:54:39.976939 1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-clusterautoscalers" I0912 19:54:39.976984 1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-machineautoscalers" I0912 19:54:39.977082 1 main.go:41] Starting cluster-autoscaler-operator I0912 19:54:39.977216 1 server.go:216] controller-runtime/webhook/webhooks "msg"="Starting webhook server" I0912 19:54:39.977693 1 certwatcher.go:161] controller-runtime/certwatcher "msg"="Updated current TLS certificate" I0912 19:54:39.977813 1 server.go:273] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=8443 I0912 19:54:39.977938 1 certwatcher.go:115] controller-runtime/certwatcher "msg"="Starting certificate watcher" I0912 19:54:39.978008 1 server.go:50] "msg"="starting server" "addr"={"IP":"127.0.0.1","Port":9191,"Zone":""} "kind"="metrics" "path"="/metrics" I0912 19:54:39.978052 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader... I0912 19:54:39.982052 1 leaderelection.go:255] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader I0912 19:54:39.983412 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ClusterAutoscaler" I0912 19:54:39.983462 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Deployment" I0912 19:54:39.983483 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Service" I0912 19:54:39.983501 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ServiceMonitor" I0912 19:54:39.983520 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.PrometheusRule" I0912 19:54:39.983532 1 controller.go:185] "msg"="Starting Controller" "controller"="cluster_autoscaler_controller" I0912 19:54:39.986041 1 controller.go:177] "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *v1beta1.MachineAutoscaler" I0912 19:54:39.986065 1 controller.go:177] "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *unstructured.Unstructured" I0912 19:54:39.986072 1 controller.go:185] "msg"="Starting Controller" "controller"="machine_autoscaler_controller" I0912 19:54:40.095808 1 webhookconfig.go:72] Webhook configuration status: created I0912 19:54:40.101613 1 controller.go:219] "msg"="Starting workers" "controller"="cluster_autoscaler_controller" "worker count"=1 I0912 19:54:40.102857 1 controller.go:219] "msg"="Starting workers" "controller"="machine_autoscaler_controller" "worker count"=1 E0912 19:58:48.113290 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF E0912 20:02:48.135610 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused E0913 13:49:02.118757 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
- blocks
-
OCPBUGS-20038 Many SNOs failed to complete install because "the cluster operator cluster-autoscaler is not available"
- Closed
- is cloned by
-
OCPBUGS-20038 Many SNOs failed to complete install because "the cluster operator cluster-autoscaler is not available"
- Closed
- links to
-
RHEA-2023:7198 rpm