Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.13.z
Affects Version/s: 4.13.z, 4.14.z
Component/s: Etcd
Labels:
- telco
- telco-4.13.z

Severity:
Important
Regression:
No
Story Points:
2
Sprint:
ETCD Sprint 245
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:
Target Version:

4.13.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:

Deploying 4.14.3 using ZTP workflow, managed cluster installation fails. 

Deploying 4.14.3 using ZTP workflow, managed cluster installation fails. 
$ oc get clusterversions.config.openshift.io 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       False         19h     Error while reconciling 4.13.23: the cluster operator etcd is degraded
 
$ oc get co
[...]
etcd                                       4.13.23   True        True          True       19h     EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 which is not fault tolerant: [{Member:ID:6006254407852546352 name:"helix28.lab.eng.tlv2.redhat.com" peerURLs:"https://10.46.55.152:2380" clientURLs:"https://10.46.55.152:2379"  Healthy:true Took:1.781203ms Error:<nil>} {Member:ID:15192601779139335333 name:"helix27.lab.eng.tlv2.redhat.com" peerURLs:"https://10.46.55.151:2380" clientURLs:"https://10.46.55.151:2379"  Healthy:true Took:1.464199ms Error:<nil>}] 

$ oc get pods -A|grep -i etcd
[...]
openshift-etcd                                     etcd-helix26.lab.eng.tlv2.redhat.com                             0/4     Init:CrashLoopBackOff   231 (4m36s ago)   19h


  - containerID: cri-o://38a10ac84aa93ecc0979aed4d1086ca66e6f85dd48d0ae7e0f96c5a4e3af99e1
    image: registry.hlxcl11.lab.eng.tlv2.redhat.com:5000/openshift-release-dev@sha256:056f88ac19e6c50429fb559f532a74a145c29169f5d70a5e4d07154c6bad42d3
    imageID: registry.hlxcl11.lab.eng.tlv2.redhat.com:5000/openshift-release-dev@sha256:056f88ac19e6c50429fb559f532a74a145c29169f5d70a5e4d07154c6bad42d3
    lastState:
      terminated:
        containerID: cri-o://38a10ac84aa93ecc0979aed4d1086ca66e6f85dd48d0ae7e0f96c5a4e3af99e1
        exitCode: 1
        finishedAt: "2023-11-20T20:51:27Z"
        message: |
          /bin/sh: line 4: NODE_helix28_lab_eng_tlv2_redhat_com_ETCD_URL_HOST: not set
        reason: Error
        startedAt: "2023-11-20T20:51:27Z"

<Below is from managed cluster>
$ journalctl -f -u kubelet
Nov 21 20:05:15 helix26.lab.eng.tlv2.redhat.com bash[5436]: E1121 20:05:15.929674    5436 kuberuntime_container.go:784] failed to remove pod init container "etcd-ensure-env-vars": rpc error: code = Unknown desc = failed to delete container k8s_etcd-ensure-env-vars_etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd_7afd5fa8b405314da3d3ef42aaa794ce_250 in pod sandbox 8d160c9e18c6f2c2c4df7aaf499e2294d09d5300d6018007675cc4cb97f9d111 from index: no such id: '0d36b992c8286f3733a1448313d3ea6160cd041cd5867cbf5e7124c602e6aec4'; Skipping pod "etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd(7afd5fa8b405314da3d3ef42aaa794ce)"

Nov 21 20:05:15 helix26.lab.eng.tlv2.redhat.com bash[5436]: E1121 20:05:15.930164    5436 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd-ensure-env-vars\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd-ensure-env-vars pod=etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd(7afd5fa8b405314da3d3ef42aaa794ce)\"" pod="openshift-etcd/etcd-helix26.lab.eng.tlv2.redhat.com" podUID=7afd5fa8b405314da3d3ef42aaa794ce

Version-Release number of selected component (if applicable):

Hub running OCP 4.14.3 with TALM 4.14.1, ACM, ArgoCD, Deploying OCP 4.14.3 Multi-Node managed cluster via ZTP workflow

How reproducible:

Always

Steps to Reproduce:

1. ZTP Git repo: http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-26-4.14 2. Start managed cluster installation using ZTP workflow triggered by vran-far-edge-vran-deployment-job
3. On hub cluster: AgentClusterInstall hangs on "finalizing" stage
4. On managed cluster cluster: errors reported in etcd pod on one master node. 
5. managed cluster fails to install

See attached must-gather generated from managed cluster

Actual results:

Expected results:

Additional info:

duplicates

OCPBUGS-23941 Blocker: One of etcd is not running after installation

Closed

Assignee:: Thomas Jungblut

Reporter:: Joshua Clark

QA Contact:: Joshua Clark

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/11/21 7:42 PM

Updated:: 2024/11/20 4:12 PM

Resolved:: 2023/12/06 2:08 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates