-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.19, 4.20, 4.21
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
No
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
When running the existing quorum-restore.sh on an etcd member that got removed prior to running the script, the recovery etcd will crash loop with a panic like: Sep 04 15:06:56 master-0 etcd[14061]: panic: removed all voters Sep 04 15:06:56 master-0 etcd[14061]: Sep 04 15:06:56 master-0 etcd[14061]: goroutine 88 [running]: Sep 04 15:06:56 master-0 etcd[14061]: go.etcd.io/etcd/raft/v3.(*raft).applyConfChange(0xc0003f3080, {0x0, {0xc001737880, 0x1, 0x1}, {0x0, 0x0, 0x0}}) Sep 04 15:06:56 master-0 etcd[14061]: go.etcd.io/etcd/raft/v3@v3.5.21/raft.go:1633 +0x1cd Sep 04 15:06:56 master-0 etcd[14061]: go.etcd.io/etcd/raft/v3.(*node).run(0xc0005b8240) Sep 04 15:06:56 master-0 etcd[14061]: go.etcd.io/etcd/raft/v3@v3.5.21/node.go:360 +0xafa Sep 04 15:06:56 master-0 etcd[14061]: created by go.etcd.io/etcd/raft/v3.RestartNode in goroutine 1 Sep 04 15:06:56 master-0 etcd[14061]: go.etcd.io/etcd/raft/v3@v3.5.21/node.go:244 +0x239
Version-Release number of selected component (if applicable):
since 4.19, where we introduced quorum-restore.sh
How reproducible:
always
Steps to Reproduce:
1. create a cluster 2. remove a member 3. shutdown all other members 4. run quorum-restore.sh on the removed member 5. observe the panic on the coming up container
Actual results:
crashlooping etcd restore container
Expected results:
no panic, just a working etcd
Additional info:
This has been reported upstream before: https://github.com/etcd-io/etcd/issues/13848 I have a regression test for etcd 3.6 that reproduces this: https://github.com/tjungblu/etcd/commit/ffd784ae1c862cdd01675a14ff652927776b9ca5