-
Story
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
BU Product Work
-
3
-
False
-
None
-
False
-
OCPSTRAT-539 - Enhance recovery procedure for full control plane failure
-
-
-
ETCD Sprint 260, ETCD Sprint 261, ETCD Sprint 262
After quorum-restore, the CEO observes the following failures:
I0917 13:33:55.393852 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"45b360b1-58a8-42a4-8afc-52ee79970c8c", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MemberAddAsLearner' failed to add new member https://10.0.0.3:2380: etcdserver: rpc not supported for learner
This happened after the first static pod rollout was done on the non-recovery host (master-0), master-2 was the recovery host, here's the member list at that time:
sh-5.1# etcdctl member list -wtable {"level":"warn","ts":"2024-09-17T13:45:25.115043Z","logger":"etcd-client","caller":"v3@v3.5.14/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017c1e0/10.0.0.3:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"} +------------------+---------+------------------------------------+-----------------------+-----------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------------------------------------+-----------------------+-----------------------+------------+ | 4c63c6fccd98edc2 | started | ci-ln-c5kxl2b-72292-s2pfb-master-0 | https://10.0.0.5:2380 | https://10.0.0.5:2379 | true | | c8c42975e5a5a301 | started | ci-ln-c5kxl2b-72292-s2pfb-master-2 | https://10.0.0.4:2380 | https://10.0.0.4:2379 | false | +------------------+---------+------------------------------------+-----------------------+-----------------------+------------+ endpoint status: +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | https://10.0.0.4:2379 | c8c42975e5a5a301 | 3.5.14 | 129 MB | true | false | 15 | 70754 | 70754 | | | https://10.0.0.5:2379 | 4c63c6fccd98edc2 | 3.5.14 | 129 MB | false | true | 15 | 70754 | 70754 | | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
As you can see, the cluster is stuck on the promotion call as the learner is already caught up on revisions.
In general, no calls should ever go to learner members of the cluster.
AC:
- the client should not issue calls to learner members
- relates to
-
OCPBUGS-42808 installation bootstrap might cause etcdserver: rpc not supported for learner
- New