Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20222

EtcdCertSignerController reconciliation failed) when stable-4.13 upgrade to 4.14

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • None
    • 4.14
    • Node / Kubelet
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      As title, update 4.13 to 4.14 in nutanix ipi disconnected cluster, upgade failed and after check the log of openshift-etcd-operator pod, found below msg:

      2023-09-02T21:58:56.826959297Z E0902 21:58:56.826925 1 base_controller.go:268] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [

      {Member:ID:5365111498206899750 name:"ci-op-j1vt7ci9-6d143-z86rp-master-2" peerURLs:"https://10.0.133.232:2380" clientURLs:"https://10.0.133.232:2379" Healthy:true Took:1.048481ms Error:<nil>}

      {Member:ID:7970579734833654707 name:"ci-op-j1vt7ci9-6d143-z86rp-master-1" peerURLs:"https://10.0.133.237:2380" clientURLs:"https://10.0.133.237:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.133.237:2379]: context deadline exceeded}

      {Member:ID:16371899399695993943 name:"ci-op-j1vt7ci9-6d143-z86rp-master-0" peerURLs:"https://10.0.133.203:2380" clientURLs:"https://10.0.133.203:2379" Healthy:true Took:810.982µs Error:<nil>}

      ]
      2023-09-02T21:58:56.828429586Z I0902 21:58:56.828400 1 status_controller.go:213] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2023-09-02T21:54:10Z","message":"NodeControllerDegraded: The master nodes not ready: node \"ci-op-j1vt7ci9-6d143-z86rp-master-1\" not ready since 2023-09-02 21:58:15 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [

      {Member:ID:5365111498206899750 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-2\" peerURLs:\"https://10.0.133.232:2380\" clientURLs:\"https://10.0.133.232:2379\" Healthy:true Took:1.048481ms Error:\u003cnil\u003e}

      {Member:ID:7970579734833654707 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-1\" peerURLs:\"https://10.0.133.237:2380\" clientURLs:\"https://10.0.133.237:2379\" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.133.237:2379]: context deadline exceeded}

      {Member:ID:16371899399695993943 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-0\" peerURLs:\"https://10.0.133.203:2380\" clientURLs:\"https://10.0.133.203:2379\" Healthy:true Took:810.982µs Error:\u003cnil\u003e}

      ]\nEtcdMembersDegraded: 2 of 3 members are available, ci-op-j1vt7ci9-6d143-z86rp-master-1 is unhealthy","reason":"AsExpected","status":"False","type":"Degraded"},

      {"lastTransitionTime":"2023-09-02T21:02:02Z","message":"NodeInstallerProgressing: 3 nodes are at revision 9\nEtcdMembersProgressing: No unstarted etcd members found","reason":"AsExpected","status":"False","type":"Progressing"}

      ,

      {"lastTransitionTime":"2023-09-02T19:36:12Z","message":"StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 9\nEtcdMembersAvailable: 2 of 3 members are available, ci-op-j1vt7ci9-6d143-z86rp-master-1 is unhealthy","reason":"AsExpected","status":"True","type":"Available"}

      ,

      {"lastTransitionTime":"2023-09-02T19:34:31Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}

      ,

      {"lastTransitionTime":"2023-09-02T20:56:05Z","message":"UpgradeBackup for 4.13.11 is located at path /etc/kubernetes/cluster-backup/upgrade-backup-4.13.11-2023-09-02_205558 on node \"ci-op-j1vt7ci9-6d143-z86rp-master-0\"","reason":"UpgradeBackupSuccessful","status":"True","type":"RecentBackup"}

      ]}}
      2023-09-02T21:58:56.839027235Z I0902 21:58:56.838968 1 event.go:298] Event(v1.ObjectReference

      {Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"dfd94b8b-31ca-4e4e-bc57-236812f46b94", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}

      ): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: The master nodes not ready: node \"ci-op-j1vt7ci9-6d143-z86rp-master-1\" not ready since 2023-09-02 21:58:15 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [

      {Member:ID:5365111498206899750 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-2\" peerURLs:\"https://10.0.133.232:2380\" clientURLs:\"https://10.0.133.232:2379\" Healthy:true Took:1.065364ms Error:<nil>}

      {Member:ID:7970579734833654707 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-1\" peerURLs:\"https://10.0.133.237:2380\" clientURLs:\"https://10.0.133.237:2379\" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.133.237:2379]: context deadline exceeded}

      {Member:ID:16371899399695993943 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-0\" peerURLs:\"https://10.0.133.203:2380\" clientURLs:\"https://10.0.133.203:2379\" Healthy:true Took:831.307µs Error:<nil>}

      ]\nEtcdMembersDegraded: 2 of 3 members are available, ci-op-j1vt7ci9-6d143-z86rp-master-1 is unhealthy" to "NodeControllerDegraded: The master nodes not ready: node \"ci-op-j1vt7ci9-6d143-z86rp-master-1\" not ready since 2023-09-02 21:58:15 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.)\nEtcdEndpointsDegraded: EtcdEndpointsController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [

      {Member:ID:5365111498206899750 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-2\" peerURLs:\"https://10.0.133.232:2380\" clientURLs:\"https://10.0.133.232:2379\" Healthy:true Took:1.048481ms Error:<nil>}

      {Member:ID:7970579734833654707 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-1\" peerURLs:\"https://10.0.133.237:2380\" clientURLs:\"https://10.0.133.237:2379\" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.133.237:2379]: context deadline exceeded}

      {Member:ID:16371899399695993943 name:\"ci-op-j1vt7ci9-6d143-z86rp-master-0\" peerURLs:\"https://10.0.133.203:2380\" clientURLs:\"https://10.0.133.203:2379\" Healthy:true Took:810.982µs Error:<nil>}

      ]\nEtcdMembersDegraded: 2 of 3 members are available, ci-op-j1vt7ci9-6d143-z86rp-master-1 is unhealthy"

      How reproducible:

      Steps to Reproduce:

      upgrade from 4.13 to 4.14,

      etcd operator log: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-ec-4.14-upgrade-from-stable-4.13-nutanix-ipi-disconnected-f14/1698049925784276992/artifacts/nutanix-ipi-disconnected-f14/gather-must-gather/artifacts/must-gather/inspect.local.1769751688981941609/namespaces/openshift-etcd-operator/pods/etcd-operator-7c64676ff7-59g89/etcd-operator/etcd-operator/logs/current.log

      and must-gather log: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-ec-4.14-upgrade-from-stable-4.13-nutanix-ipi-disconnected-f14/1698049925784276992/artifacts/nutanix-ipi-disconnected-f14/gather-must-gather/artifacts/must-gather/inspect.local.1769751688981941609/

      Actual results:
      upgrade failed with err:
      2023-09-02T21:57:56.613937866Z E0902 21:57:56.613900 1 base_controller.go:268] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: etcd cluster has quorum of 2 and 2 healthy members which is not fault tolerant: [

      {Member:ID:5365111498206899750 name:"ci-op-j1vt7ci9-6d143-z86rp-master-2" peerURLs:"https://10.0.133.232:2380" clientURLs:"https://10.0.133.232:2379" Healthy:true Took:4.311948ms Error:<nil>}

      {Member:ID:7970579734833654707 name:"ci-op-j1vt7ci9-6d143-z86rp-master-1" peerURLs:"https://10.0.133.237:2380" clientURLs:"https://10.0.133.237:2379" Healthy:false Took: Error:create client failure: failed to make etcd client for endpoints [https://10.0.133.237:2379]: context deadline exceeded}

      {Member:ID:16371899399695993943 name:"ci-op-j1vt7ci9-6d143-z86rp-master-0" peerURLs:"https://10.0.133.203:2380" clientURLs:"https://10.0.133.203:2379" Healthy:true Took:746.64µs Error:<nil>}

      ]

      Expected results
      upgrade succssfully

            aos-node@redhat.com Node Team Bot Account
            geliu ge liu
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: