Uploaded image for project: 'OpenShift Etcd'
  1. OpenShift Etcd
  2. ETCD-673

CEO routing etcd client calls to learner members

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • BU Product Work
    • 3
    • False
    • None
    • False
    • OCPSTRAT-539 - Enhance recovery procedure for full control plane failure
    • ETCD Sprint 260, ETCD Sprint 261, ETCD Sprint 262

      After quorum-restore, the CEO observes the following failures:

      I0917 13:33:55.393852 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"45b360b1-58a8-42a4-8afc-52ee79970c8c", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MemberAddAsLearner' failed to add new member https://10.0.0.3:2380: etcdserver: rpc not supported for learner

      This happened after the first static pod rollout was done on the non-recovery host (master-0), master-2 was the recovery host, here's the member list at that time:

      sh-5.1# etcdctl member list -wtable 
      {"level":"warn","ts":"2024-09-17T13:45:25.115043Z","logger":"etcd-client","caller":"v3@v3.5.14/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017c1e0/10.0.0.3:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"}
      +------------------+---------+------------------------------------+-----------------------+-----------------------+------------+
      |        ID        | STATUS  |                NAME                |      PEER ADDRS       |     CLIENT ADDRS      | IS LEARNER |
      +------------------+---------+------------------------------------+-----------------------+-----------------------+------------+
      | 4c63c6fccd98edc2 | started | ci-ln-c5kxl2b-72292-s2pfb-master-0 | https://10.0.0.5:2380 | https://10.0.0.5:2379 |       true |
      | c8c42975e5a5a301 | started | ci-ln-c5kxl2b-72292-s2pfb-master-2 | https://10.0.0.4:2380 | https://10.0.0.4:2379 |      false |
      +------------------+---------+------------------------------------+-----------------------+-----------------------+------------+
      
      
      endpoint status:
      +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
      +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      | https://10.0.0.4:2379 | c8c42975e5a5a301 |  3.5.14 |  129 MB |      true |      false |        15 |      70754 |              70754 |        |
      | https://10.0.0.5:2379 | 4c63c6fccd98edc2 |  3.5.14 |  129 MB |     false |       true |        15 |      70754 |              70754 |        |
      +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

      As you can see, the cluster is stuck on the promotion call as the learner is already caught up on revisions.

      In general, no calls should ever go to learner members of the cluster. 

      AC:

      • the client should not issue calls to learner members

              alray@redhat.com Allen Ray
              tjungblu@redhat.com Thomas Jungblut
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: