-
Bug
-
Resolution: Done-Errata
-
Undefined
-
4.13.z, 4.14.z
Description of problem:
Deploying 4.14.3 using ZTP workflow, managed cluster installation fails. Deploying 4.14.3 using ZTP workflow, managed cluster installation fails. $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 19h Error while reconciling 4.13.23: the cluster operator etcd is degraded $ oc get co [...] etcd 4.13.23 True True True 19h EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 which is not fault tolerant: [{Member:ID:6006254407852546352 name:"helix28.lab.eng.tlv2.redhat.com" peerURLs:"https://10.46.55.152:2380" clientURLs:"https://10.46.55.152:2379" Healthy:true Took:1.781203ms Error:<nil>} {Member:ID:15192601779139335333 name:"helix27.lab.eng.tlv2.redhat.com" peerURLs:"https://10.46.55.151:2380" clientURLs:"https://10.46.55.151:2379" Healthy:true Took:1.464199ms Error:<nil>}] $ oc get pods -A|grep -i etcd [...] openshift-etcd etcd-helix26.lab.eng.tlv2.redhat.com 0/4 Init:CrashLoopBackOff 231 (4m36s ago) 19h - containerID: cri-o://38a10ac84aa93ecc0979aed4d1086ca66e6f85dd48d0ae7e0f96c5a4e3af99e1 image: registry.hlxcl11.lab.eng.tlv2.redhat.com:5000/openshift-release-dev@sha256:056f88ac19e6c50429fb559f532a74a145c29169f5d70a5e4d07154c6bad42d3 imageID: registry.hlxcl11.lab.eng.tlv2.redhat.com:5000/openshift-release-dev@sha256:056f88ac19e6c50429fb559f532a74a145c29169f5d70a5e4d07154c6bad42d3 lastState: terminated: containerID: cri-o://38a10ac84aa93ecc0979aed4d1086ca66e6f85dd48d0ae7e0f96c5a4e3af99e1 exitCode: 1 finishedAt: "2023-11-20T20:51:27Z" message: | /bin/sh: line 4: NODE_helix28_lab_eng_tlv2_redhat_com_ETCD_URL_HOST: not set reason: Error startedAt: "2023-11-20T20:51:27Z" <Below is from managed cluster> $ journalctl -f -u kubelet Nov 21 20:05:15 helix26.lab.eng.tlv2.redhat.com bash[5436]: E1121 20:05:15.929674 5436 kuberuntime_container.go:784] failed to remove pod init container "etcd-ensure-env-vars": rpc error: code = Unknown desc = failed to delete container k8s_etcd-ensure-env-vars_etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd_7afd5fa8b405314da3d3ef42aaa794ce_250 in pod sandbox 8d160c9e18c6f2c2c4df7aaf499e2294d09d5300d6018007675cc4cb97f9d111 from index: no such id: '0d36b992c8286f3733a1448313d3ea6160cd041cd5867cbf5e7124c602e6aec4'; Skipping pod "etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd(7afd5fa8b405314da3d3ef42aaa794ce)" Nov 21 20:05:15 helix26.lab.eng.tlv2.redhat.com bash[5436]: E1121 20:05:15.930164 5436 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd-ensure-env-vars\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd-ensure-env-vars pod=etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd(7afd5fa8b405314da3d3ef42aaa794ce)\"" pod="openshift-etcd/etcd-helix26.lab.eng.tlv2.redhat.com" podUID=7afd5fa8b405314da3d3ef42aaa794ce
Version-Release number of selected component (if applicable):
Hub running OCP 4.14.3 with TALM 4.14.1, ACM, ArgoCD, Deploying OCP 4.14.3 Multi-Node managed cluster via ZTP workflow
How reproducible:
Always
Steps to Reproduce:
1. ZTP Git repo: http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-26-4.14 2. Start managed cluster installation using ZTP workflow triggered by vran-far-edge-vran-deployment-job 3. On hub cluster: AgentClusterInstall hangs on "finalizing" stage 4. On managed cluster cluster: errors reported in etcd pod on one master node. 5. managed cluster fails to install See attached must-gather generated from managed cluster
Actual results:
Expected results:
Additional info:
- duplicates
-
OCPBUGS-23941 Blocker: One of etcd is not running after installation
- Closed