Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-61366

Panic when running quorum-restore on removed etcd member

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.19, 4.20, 4.21
    • Etcd
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • No
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When running the existing quorum-restore.sh on an etcd member that got removed prior to running the script, the recovery etcd will crash loop with a panic like:
      
      Sep 04 15:06:56 master-0 etcd[14061]: panic: removed all voters
      Sep 04 15:06:56 master-0 etcd[14061]: 
      Sep 04 15:06:56 master-0 etcd[14061]: goroutine 88 [running]:
      Sep 04 15:06:56 master-0 etcd[14061]: go.etcd.io/etcd/raft/v3.(*raft).applyConfChange(0xc0003f3080, {0x0, {0xc001737880, 0x1, 0x1}, {0x0, 0x0, 0x0}})
      Sep 04 15:06:56 master-0 etcd[14061]:         go.etcd.io/etcd/raft/v3@v3.5.21/raft.go:1633 +0x1cd
      Sep 04 15:06:56 master-0 etcd[14061]: go.etcd.io/etcd/raft/v3.(*node).run(0xc0005b8240)
      Sep 04 15:06:56 master-0 etcd[14061]:         go.etcd.io/etcd/raft/v3@v3.5.21/node.go:360 +0xafa
      Sep 04 15:06:56 master-0 etcd[14061]: created by go.etcd.io/etcd/raft/v3.RestartNode in goroutine 1
      Sep 04 15:06:56 master-0 etcd[14061]:         go.etcd.io/etcd/raft/v3@v3.5.21/node.go:244 +0x239

      Version-Release number of selected component (if applicable):

      since 4.19, where we introduced quorum-restore.sh    

      How reproducible:

      always    

      Steps to Reproduce:

          1. create a cluster
          2. remove a member
          3. shutdown all other members
          4. run quorum-restore.sh on the removed member
          5. observe the panic on the coming up container

      Actual results:

      crashlooping etcd restore container

      Expected results:

      no panic, just a working etcd     

      Additional info:

      This has been reported upstream before:
      https://github.com/etcd-io/etcd/issues/13848
      
      I have a regression test for etcd 3.6 that reproduces this:
      https://github.com/tjungblu/etcd/commit/ffd784ae1c862cdd01675a14ff652927776b9ca5
      
      
      

       

              tjungblu@redhat.com Thomas Jungblut
              tjungblu@redhat.com Thomas Jungblut
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: