-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.12
-
Critical
-
No
-
Rejected
-
False
-
Description of problem:
A customer installation of a 4.12.4 cluster failed with a possible race condition. It appears that bootstrapping failed but CPMS was still working.
Version-Release number of selected component (if applicable):
4.12.4
A CPD was identified with no immediate cause determined. I broke glass to the cluster and checked the Cluster Operators and all were up and running.
dalong@dalong ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.12.4 True False False 5h38m baremetal 4.12.4 True False False 6h20m cloud-controller-manager 4.12.4 True False False 6h24m cloud-credential 4.12.4 True False False 6h24m cluster-autoscaler 4.12.4 True False False 6h20m config-operator 4.12.4 True False False 6h21m console 4.12.4 True False False 5h39m control-plane-machine-set 4.12.4 True False False 6h18m csi-snapshot-controller 4.12.4 True False False 6h21m dns 4.12.4 True False False 6h20m etcd 4.12.4 True False False 5h52m image-registry 4.12.4 True False False 5h45m ingress 4.12.4 True False False 5h45m insights 4.12.4 True False False 6h14m kube-apiserver 4.12.4 True False False 5h41m kube-controller-manager 4.12.4 True False False 6h18m kube-scheduler 4.12.4 True False False 6h17m kube-storage-version-migrator 4.12.4 True False False 6h21m machine-api 4.12.4 True False False 5h46m machine-approver 4.12.4 True False False 6h20m machine-config 4.12.4 True False False 6h10m marketplace 4.12.4 True False False 6h20m monitoring 4.12.4 True False False 5h43m network 4.12.4 True False False 6h23m node-tuning 4.12.4 True False False 6h20m openshift-apiserver 4.12.4 True False False 5h41m openshift-controller-manager 4.12.4 True False False 6h16m openshift-samples 4.12.4 True False False 5h47m operator-lifecycle-manager 4.12.4 True False False 6h21m operator-lifecycle-manager-catalog 4.12.4 True False False 6h21m operator-lifecycle-manager-packageserver 4.12.4 True False False 5h49m service-ca 4.12.4 True False False 6h21m storage 4.12.4 True False False 6h20m
Network-Verifier was also ran against the cluster and came back successful.
Checking the machines everything looked good as well.
[dalong@dalong ~]$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api rosa-t7zkl-gslmr-master-1 Running m5.2xlarge eu-west-1 eu-west-1a 6h42m openshift-machine-api rosa-t7zkl-gslmr-master-2 Running m5.2xlarge eu-west-1 eu-west-1a 6h42m openshift-machine-api rosa-t7zkl-gslmr-master-mqx8c-0 Running m5.2xlarge eu-west-1 eu-west-1a 6h36m openshift-machine-api rosa-t7zkl-gslmr-worker-eu-west-1a-cl7lf Running m5.xlarge eu-west-1 eu-west-1a 6h37m openshift-machine-api rosa-t7zkl-gslmr-worker-eu-west-1a-hll6m Running m5.xlarge eu-west-1 eu-west-1a 6h37m
At the moment bootstrapping failed, CPMS was still working which seems to have caused the failed install.
time="2023-02-28T09:21:13Z" level=info msg="Cluster operator control-plane-machine-set Progressing is True with ExcessReplicas: Waiting for 1 old replica(s) to be removed"
- causes
-
OCPBUGS-10351 Vertical Scaling: do not trigger inadvertent machine deletion during bootstrap
- Closed
- mentioned on