Uploaded image for project: 'OpenShift Hosted Control Plane'
  1. OpenShift Hosted Control Plane
  2. HOSTEDCP-181

incorrect condition status in etcdcluster resource for hosted cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Normal Normal
    • None
    • None
    • None
    • False
    • False
    • Undefined
    • 0
    • 0
    • 0

      This bug is found during the test for HOSTEDCP-112

      The testcase record:  OCP-42855

      0) use `hypershift create cluster` to create a hosted cluster. Check etcd status in the control plane of hosted cluster. Namespace is : clusters-{cluster-name}

      1) when etcd pod is deleted,  the condition status in hostedcontrolplane shows etcd status is still True.  The expected value is False.

      2) Confirmed with hypershift dev, the failure is caused by incorrect status in etcdcluster resource. The right place to fix this is in the etcd operator. 

       

      harry@liuhedeMacBook-Pro openshift % oc get pods -n clusters-example                               
      NAME                                              READY   STATUS             RESTARTS   AGE
      capa-controller-manager-7888cb46bd-s7lsd          1/1     Running            0          144m
      certified-operators-catalog-6d8b854bc4-lsslh      1/1     Running            0          141m
      cluster-api-5b84c5f55f-pntv2                      1/1     Running            0          144m
      cluster-autoscaler-dc987fbf9-vcptb                0/1     CrashLoopBackOff   25         144m
      cluster-policy-controller-bb5ccd9b4-dtddk         1/1     Running            1          144m
      cluster-version-operator-5bcfdf9dff-xjw26         1/1     Running            1          144m
      community-operators-catalog-8fb599ff8-7z2t5       1/1     Running            0          144m
      control-plane-operator-7d8995bb59-72xtz           1/1     Running            0          144m
      etcd-operator-77bb448cd6-hhtgn                    1/1     Running            0          144m
      hosted-cluster-config-operator-6bd9f8f6b5-6h2hv   0/1     CrashLoopBackOff   24         144m
      ignition-server-658d664cd4-tx28h                  1/1     Running            0          144m
      konnectivity-agent-64bf56499d-l9vvk               1/1     Running            0          144m
      konnectivity-server-846bf4785b-qgdmm              1/1     Running            0          144m
      kube-apiserver-f57b885b7-b9znp                    1/2     CrashLoopBackOff   23         104m
      kube-controller-manager-84784486fb-gjg6h          0/1     CrashLoopBackOff   22         133m
      kube-scheduler-7bcbd46b96-hh9wt                   1/1     Running            3          144m
      manifests-bootstrapper                            0/1     Completed          3          144m
      oauth-openshift-569bff6d59-9t44l                  1/1     Running            0          142m
      olm-operator-5c5fd8b476-k6844                     1/1     Running            4          144m
      openshift-apiserver-858db5866d-kvbjr              1/1     Running            0          142m
      openshift-controller-manager-dfd489d78-bbzj5      1/1     Running            1          144m
      openshift-oauth-apiserver-cf48fd997-dqtmr         0/1     CrashLoopBackOff   24         144m
      packageserver-fd6b48fb7-76fls                     1/1     Running            3          144m
      packageserver-fd6b48fb7-p9bhz                     1/1     Running            2          144m
      redhat-marketplace-catalog-6968fc9c6c-tlpcl       1/1     Running            0          144m
      redhat-operators-catalog-665cccdf4f-d6fgr         1/1     Running            0          144m
      

      It shows there is no etcd pod anymore. And apiserver is crashed too.

       

      Check hostedcontrolplane, etcd status is True (not expected) and apiserver is False (expected)

       

      harry@liuhedeMacBook-Pro openshift % oc describe hostedcontrolplane -n clusters-example            
      Name:         example
      Namespace:    clusters-example
      Labels:       cluster.x-k8s.io/cluster-name=example-x674v
      Annotations:  hypershift.openshift.io/cluster: clusters/example
      API Version:  hypershift.openshift.io/v1alpha1
      Kind:         HostedControlPlane
      Metadata:
        Creation Timestamp:  2021-07-19T01:44:41Z
        Finalizers:
          hypershift.openshift.io/finalizer
        Generation:  1
      
      ...
      
      Spec:
        Dns:
          Base Domain:      qe.devcluster.openshift.com
          Private Zone ID:  Z00373243GZL3D8KJVFEE
          Public Zone ID:   Z3B3KOVA3TRCWP
        Etcd:
          Management Type:  Managed
        Fips:               false
        Infra ID:           example-x674v
        Issuer URL:         https://oidc-example-x674v.apps.heli-0719.qe.devcluster.openshift.com
        Machine CIDR:       10.0.0.0/16
        Network Type:       OpenShiftSDN
        Platform:
          Aws:
            Cloud Provider Config:
              Subnet:
                Id:  subnet-01366704f33296772
              Vpc:   vpc-05c067a42ae9c71dc
              Zone:  us-east-2a
            Kube Cloud Controller Creds:
              Name:  provider-creds
            Node Pool Management Creds:
              Name:  node-provider-creds
            Region:  us-east-2
            Roles:
              Arn:        arn:aws:iam::301721915996:role/example-x674v-openshift-ingress
              Name:       cloud-credentials
              Namespace:  openshift-ingress-operator
              Arn:        arn:aws:iam::301721915996:role/example-x674v-openshift-image-registry
              Name:       installer-cloud-credentials
              Namespace:  openshift-image-registry
              Arn:        arn:aws:iam::301721915996:role/example-x674v-aws-ebs-csi-driver-operator
              Name:       ebs-cloud-credentials
              Namespace:  openshift-cluster-csi-drivers
          Type:           AWS
        Pod CIDR:         10.132.0.0/14
        Pull Secret:
          Name:         pull-secret
        Release Image:  quay.io/openshift-release-dev/ocp-release:4.8.0-x86_64
        Service CIDR:   172.31.0.0/16
        Services:
          Service:  APIServer
          Service Publishing Strategy:
            Type:   LoadBalancer
          Service:  OAuthServer
          Service Publishing Strategy:
            Type:   Route
          Service:  OIDC
          Service Publishing Strategy:
            Type:   Route
          Service:  Konnectivity
          Service Publishing Strategy:
            Type:  LoadBalancer
        Signing Key:
          Name:  signing-key
        Ssh Key:
      Status:
        Conditions:
          Last Transition Time:  2021-07-19T01:44:58Z
          Message:               Configuration passes validation
          Observed Generation:   1
          Reason:                HostedClusterAsExpected
          Status:                True
          Type:                  ValidConfiguration
          Last Transition Time:  2021-07-19T01:45:59Z
          Message:               Etcd cluster is running and available
          Observed Generation:   1
          Reason:                EtcdRunning
          Status:                True
          Type:                  EtcdAvailable
          Last Transition Time:  2021-07-19T02:31:27Z
          Message:               
          Observed Generation:   1
          Reason:                DeploymentStatusUnknown
          Status:                False
          Type:                  KubeAPIServerAvailable
          Last Transition Time:  2021-07-19T02:31:27Z
          Message:               Not all dependent components are available yet
          Observed Generation:   1
          Reason:                ComponentsUnavailable
          Status:                False
          Type:                  Available
          Last Transition Time:  2021-07-19T01:45:03Z
          Message:               
          Observed Generation:   1
          Reason:                AsExpected
          Status:                True
          Type:                  InfrastructureReady
        Control Plane Endpoint:
          Host:                          a4f7b9b9b5c9a4002ba0281167dd0083-110256206.us-east-2.elb.amazonaws.com
          Port:                          6443
        External Managed Control Plane:  true
        Initialized:                     true
        Kube Config:
          Key:                               kubeconfig
          Name:                              admin-kubeconfig
        Last Release Image Transition Time:  2021-07-19T01:44:58Z
        Ready:                               false
        Release Image:                       quay.io/openshift-release-dev/ocp-release:4.8.0-x86_64
        Version:                             4.8.0

       

      2 In the above test, there is only one etcd pod in namespace clusters-example. After deleting etcd pod manually, why can't it be recovered by etcd operator automatically ? 

      Check logs of etcd operator:

       

      time="2021-07-19T04:16:11Z" level=warning msg="all etcd pods are dead." cluster-name=etcd cluster-namespace=clusters-example pkg=cluster
      time="2021-07-19T04:16:19Z" level=warning msg="all etcd pods are dead." cluster-name=etcd cluster-namespace=clusters-example pkg=cluster
      time="2021-07-19T04:16:27Z" level=warning msg="all etcd pods are dead." cluster-name=etcd cluster-namespace=clusters-example pkg=cluster
      time="2021-07-19T04:16:35Z" level=warning msg="all etcd pods are dead." cluster-name=etcd cluster-namespace=clusters-example pkg=cluster
      

       

              Unassigned Unassigned
              rhn-support-heli He Liu
              He Liu He Liu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: