Loading...

XML

Word

Printable

Type: Story
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Work Type:
BU Product Work
Story Points:
3
Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
Disaster Recovery Automation
Feature Link:
OCPSTRAT-539 - Enhance recovery procedure for full control plane failure
Intelligence Requested:
Market:

Sprint:
ETCD Sprint 260, ETCD Sprint 261, ETCD Sprint 262

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

After quorum-restore, the CEO observes the following failures:

I0917 13:33:55.393852 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"45b360b1-58a8-42a4-8afc-52ee79970c8c", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MemberAddAsLearner' failed to add new member https://10.0.0.3:2380: etcdserver: rpc not supported for learner

This happened after the first static pod rollout was done on the non-recovery host (master-0), master-2 was the recovery host, here's the member list at that time:

sh-5.1# etcdctl member list -wtable 
{"level":"warn","ts":"2024-09-17T13:45:25.115043Z","logger":"etcd-client","caller":"v3@v3.5.14/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00017c1e0/10.0.0.3:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: rpc not supported for learner"}
+------------------+---------+------------------------------------+-----------------------+-----------------------+------------+
|        ID        | STATUS  |                NAME                |      PEER ADDRS       |     CLIENT ADDRS      | IS LEARNER |
+------------------+---------+------------------------------------+-----------------------+-----------------------+------------+
| 4c63c6fccd98edc2 | started | ci-ln-c5kxl2b-72292-s2pfb-master-0 | https://10.0.0.5:2380 | https://10.0.0.5:2379 |       true |
| c8c42975e5a5a301 | started | ci-ln-c5kxl2b-72292-s2pfb-master-2 | https://10.0.0.4:2380 | https://10.0.0.4:2379 |      false |
+------------------+---------+------------------------------------+-----------------------+-----------------------+------------+


endpoint status:
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.0.4:2379 | c8c42975e5a5a301 |  3.5.14 |  129 MB |      true |      false |        15 |      70754 |              70754 |        |
| https://10.0.0.5:2379 | 4c63c6fccd98edc2 |  3.5.14 |  129 MB |     false |       true |        15 |      70754 |              70754 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

As you can see, the cluster is stuck on the promotion call as the learner is already caught up on revisions.

In general, no calls should ever go to learner members of the cluster.

AC:

the client should not issue calls to learner members

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

log.txt
2.86 MB
2024/09/17 2:46 PM

relates to

OCPBUGS-42808 installation bootstrap might cause etcdserver: rpc not supported for learner

Assignee:: Allen Ray

Reporter:: Thomas Jungblut

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/09/17 1:41 PM

Updated:: 2024/11/12 2:55 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates