Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23522

Multi-Node Managed Cluster Installation Fails with 4.14.3

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • 4.13.z
    • 4.13.z
    • Etcd
    • Important
    • No
    • 2
    • ETCD Sprint 245
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Deploying 4.14.3 using ZTP workflow, managed cluster installation fails. 
      
      Deploying 4.14.3 using ZTP workflow, managed cluster installation fails. 
      $ oc get clusterversions.config.openshift.io 
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version             False       False         19h     Error while reconciling 4.13.23: the cluster operator etcd is degraded
       
      $ oc get co
      [...]
      etcd                                       4.13.23   True        True          True       19h     EtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 which is not fault tolerant: [{Member:ID:6006254407852546352 name:"helix28.lab.eng.tlv2.redhat.com" peerURLs:"https://10.46.55.152:2380" clientURLs:"https://10.46.55.152:2379"  Healthy:true Took:1.781203ms Error:<nil>} {Member:ID:15192601779139335333 name:"helix27.lab.eng.tlv2.redhat.com" peerURLs:"https://10.46.55.151:2380" clientURLs:"https://10.46.55.151:2379"  Healthy:true Took:1.464199ms Error:<nil>}] 
      
      $ oc get pods -A|grep -i etcd
      [...]
      openshift-etcd                                     etcd-helix26.lab.eng.tlv2.redhat.com                             0/4     Init:CrashLoopBackOff   231 (4m36s ago)   19h
      
      
        - containerID: cri-o://38a10ac84aa93ecc0979aed4d1086ca66e6f85dd48d0ae7e0f96c5a4e3af99e1
          image: registry.hlxcl11.lab.eng.tlv2.redhat.com:5000/openshift-release-dev@sha256:056f88ac19e6c50429fb559f532a74a145c29169f5d70a5e4d07154c6bad42d3
          imageID: registry.hlxcl11.lab.eng.tlv2.redhat.com:5000/openshift-release-dev@sha256:056f88ac19e6c50429fb559f532a74a145c29169f5d70a5e4d07154c6bad42d3
          lastState:
            terminated:
              containerID: cri-o://38a10ac84aa93ecc0979aed4d1086ca66e6f85dd48d0ae7e0f96c5a4e3af99e1
              exitCode: 1
              finishedAt: "2023-11-20T20:51:27Z"
              message: |
                /bin/sh: line 4: NODE_helix28_lab_eng_tlv2_redhat_com_ETCD_URL_HOST: not set
              reason: Error
              startedAt: "2023-11-20T20:51:27Z"
      
      <Below is from managed cluster>
      $ journalctl -f -u kubelet
      Nov 21 20:05:15 helix26.lab.eng.tlv2.redhat.com bash[5436]: E1121 20:05:15.929674    5436 kuberuntime_container.go:784] failed to remove pod init container "etcd-ensure-env-vars": rpc error: code = Unknown desc = failed to delete container k8s_etcd-ensure-env-vars_etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd_7afd5fa8b405314da3d3ef42aaa794ce_250 in pod sandbox 8d160c9e18c6f2c2c4df7aaf499e2294d09d5300d6018007675cc4cb97f9d111 from index: no such id: '0d36b992c8286f3733a1448313d3ea6160cd041cd5867cbf5e7124c602e6aec4'; Skipping pod "etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd(7afd5fa8b405314da3d3ef42aaa794ce)"
      
      Nov 21 20:05:15 helix26.lab.eng.tlv2.redhat.com bash[5436]: E1121 20:05:15.930164    5436 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd-ensure-env-vars\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=etcd-ensure-env-vars pod=etcd-helix26.lab.eng.tlv2.redhat.com_openshift-etcd(7afd5fa8b405314da3d3ef42aaa794ce)\"" pod="openshift-etcd/etcd-helix26.lab.eng.tlv2.redhat.com" podUID=7afd5fa8b405314da3d3ef42aaa794ce
      

      Version-Release number of selected component (if applicable):

      Hub running OCP 4.14.3 with TALM 4.14.1, ACM, ArgoCD, Deploying OCP 4.14.3 Multi-Node managed cluster via ZTP workflow

      How reproducible:

      Always

      Steps to Reproduce:

      1. ZTP Git repo: http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/kni-qe-26-4.14 2. Start managed cluster installation using ZTP workflow triggered by vran-far-edge-vran-deployment-job
      3. On hub cluster: AgentClusterInstall hangs on "finalizing" stage
      4. On managed cluster cluster: errors reported in etcd pod on one master node. 
      5. managed cluster fails to install
      
      See attached must-gather generated from managed cluster

      Actual results:

       

      Expected results:

       

      Additional info:

       

            tjungblu@redhat.com Thomas Jungblut
            josclark@redhat.com Joshua Clark
            Joshua Clark Joshua Clark
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: