-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.11.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
-
None
-
None
-
Rejected
-
PM Sync
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Installing 1000+ SNOs via ACM/MCE via ZTP with gitops, a small percentage of clusters end up never completing install because the openshift-controller-manager operator does not reconcile to available.
Example:
oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get co openshift-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE openshift-controller-manager False True False 17h Available: no daemon pods available on any node.
Version-Release number of selected component (if applicable):
Hub OCP and SNO OCP - 4.11.0
ACM - 2.6.0-DOWNSTREAM-2022-08-11-23-41-09 (FC5)
How reproducible:
- 21 out of 23 failures out of 1728 installs
- ~90% of the failures are because of this issue
- failure rate of ~1.2% of the total installs
Additional info:
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 18h Unable to apply 4.11.0: the cluster operator openshift-controller-manager has not yet successfully rolled out # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.11.0 True False False 10m baremetal 4.11.0 True False False 17h cloud-controller-manager 4.11.0 True False False 17h cloud-credential 4.11.0 True False False 17h cluster-autoscaler 4.11.0 True False False 17h config-operator 4.11.0 True False False 18h console 4.11.0 True False False 16h csi-snapshot-controller 4.11.0 True False False 17h dns 4.11.0 True False False 17h etcd 4.11.0 True False False 18h image-registry 4.11.0 True False False 17h ingress 4.11.0 True False False 17h insights 4.11.0 True False False 51s kube-apiserver 4.11.0 True False False 17h kube-controller-manager 4.11.0 True False False 17h kube-scheduler 4.11.0 True False False 17h kube-storage-version-migrator 4.11.0 True False False 17h machine-api 4.11.0 True False False 17h machine-approver 4.11.0 True False False 17h machine-config 4.11.0 True False False 17h marketplace 4.11.0 True False False 17h monitoring 4.11.0 True False False 17h network 4.11.0 True False False 18h node-tuning 4.11.0 True False False 17h openshift-apiserver 4.11.0 True False False 11m openshift-controller-manager False True False 18h Available: no daemon pods available on any node. openshift-samples 4.11.0 True False False 17h operator-lifecycle-manager 4.11.0 True False False 17h operator-lifecycle-manager-catalog 4.11.0 True False False 17h operator-lifecycle-manager-packageserver 4.11.0 True False False 17h service-ca 4.11.0 True False False 18h storage 4.11.0 True False False 17h # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get po -n openshift-controller-manager NAME READY STATUS RESTARTS AGE controller-manager-26stq 0/1 Terminating 0 17h # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig describe po -n openshift-controller-manager Name: controller-manager-26stq Namespace: openshift-controller-manager Priority: 2000001000 Priority Class Name: system-node-critical Node: sno00090/fc00:1001::441 Start Time: Mon, 15 Aug 2022 20:40:49 +0000 Labels: app=openshift-controller-manager controller-manager=true controller-revision-hash=5cdd6cbbf4 pod-template-generation=3 Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["fd01:0:0:1::26/64"],"mac_address":"0a:58:4c:a7:02:14","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:... k8s.v1.cni.cncf.io/network-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:1::26" ], "mac": "0a:58:4c:a7:02:14", "default": true, "dns": {} }] k8s.v1.cni.cncf.io/networks-status: [{ "name": "ovn-kubernetes", "interface": "eth0", "ips": [ "fd01:0:0:1::26" ], "mac": "0a:58:4c:a7:02:14", "default": true, "dns": {} }] openshift.io/scc: restricted-v2 operator.openshift.io/force: a45bf93a-5310-494f-9b47-efe5e122e439 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Terminating (lasts 17h) Termination Grace Period: 30s IP: fd01:0:0:1::26 IPs: IP: fd01:0:0:1::26 Controlled By: DaemonSet/controller-manager Containers: controller-manager: Container ID: Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fead5da1351abee9e77f5c46247c7947d6f9cf76b5bf7a8b4905222fe2564665 Image ID: Port: 8443/TCP Host Port: 0/TCP Command: openshift-controller-manager start Args: --config=/var/run/configmaps/config/config.yaml -v=2 State: Terminated Reason: ContainerStatusUnknown Message: The container could not be located when the pod was terminated Exit Code: 137 Started: Mon, 01 Jan 0001 00:00:00 +0000 Finished: Mon, 01 Jan 0001 00:00:00 +0000 Ready: False Restart Count: 0 Requests: cpu: 100m memory: 100Mi Environment: <none> Mounts: /etc/pki/ca-trust/extracted/pem from proxy-ca-bundles (rw) /var/run/configmaps/client-ca from client-ca (rw) /var/run/configmaps/config from config (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-d6rhr (ro) /var/run/secrets/serving-cert from serving-cert (rw) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: config: Type: ConfigMap (a volume populated by a ConfigMap) Name: config Optional: false client-ca: Type: ConfigMap (a volume populated by a ConfigMap) Name: client-ca Optional: false serving-cert: Type: Secret (a volume populated by a Secret) SecretName: serving-cert Optional: false proxy-ca-bundles: Type: ConfigMap (a volume populated by a ConfigMap) Name: openshift-global-ca Optional: false kube-api-access-d6rhr: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master= Tolerations: op=Exists Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedMount 81m (x493 over 17h) kubelet MountVolume.SetUp failed for volume "proxy-ca-bundles" : object "openshift-controller-manager"/"openshift-global-ca" not registered Warning FailedMount 63m (x502 over 17h) kubelet MountVolume.SetUp failed for volume "kube-api-access-d6rhr" : [object "openshift-controller-manager"/"kube-root-ca.crt" not registered, object "openshift-controller-manager"/"openshift-service-ca.crt" not registered] Warning FailedMount 16m (x525 over 17h) kubelet MountVolume.SetUp failed for volume "client-ca" : object "openshift-controller-manager"/"client-ca" not registered Warning FailedMount 12m (x527 over 17h) kubelet MountVolume.SetUp failed for volume "serving-cert" : object "openshift-controller-manager"/"serving-cert" not registered Warning FailedMount 2m34s (x532 over 17h) kubelet MountVolume.SetUp failed for volume "config" : object "openshift-controller-manager"/"config" not registered # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig logs -n openshift-controller-manager controller-manager-26stq Error from server (BadRequest): container "controller-manager" in pod "controller-manager-26stq" is terminated
Restarting kubelet resolves the issue
# ssh core@sno00090 sudo systemctl restart kubelet
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get clusterversion -w
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version False True 18h Unable to apply 4.11.0: the cluster operator openshift-controller-manager has not yet successfully rolled out
version False True 18h Working towards 4.11.0: 801 of 802 done (99% complete)
version 4.11.0 True False 0s Cluster version is 4.11.0