Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52181

Cluster ID is not updated when --force-new-cluster is supplied

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.19
    • Etcd
    • None
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      Description of problem:

          When using --force-new-cluster, in OCP through quorum-restore, the cluster id *should* change to reflect the new membership (cluster id is generated from the etcd membership); however, the cluster id is just read from the WAL and set so that the "new" cluster has the same id as the old.
      
      For quorum-restore, the non-restore'd nodes expect the cluster id to have changed in order for them to automatically move their data directory so they can rejoin the cluster (https://github.com/openshift/etcd/pull/284/files#diff-3d1414e0d47047e4fdcae958fb7654d082c1eb1b71cf4eb38aeb3190db678208R182). Quorum-restore still works as intended as the non-restore'd members will eventually crashloop since they're unable to join the restore'd cluster as their member information mismatches the restore'd cluster. They crashloop after a time, and eventually the CEO will notice that they're trying to join and help them along to rejoin the cluster. Therefore, the end effect is correct, but the path taken is unintended.

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          always

      Steps to Reproduce:

          1. Start with standard Openshift deployment
          2. unmanage the CEO (https://github.com/openshift/cluster-etcd-operator/blob/main/hack/unmanage.sh)
          3. oc debug into a node
          4. run /usr/local/bin/quorum-restore.sh
          5. (exit)
          6. oc logs into the non-recovery etcd pod
          7. see that the pods are attempting to reconnect and can't due to membership mismatch - also note that the cluster id's are the same
          

      Actual results:

          #### attempt 7
      Live Cluster ID: [18c4c43dc346df9a], local: [18c4c43dc346df9a] 
            member={name="", peerURLs=[https://10.0.0.3:2380}, clientURLs=[]
            member={name="ci-ln-wj3h05t-72292-tsc2b-master-0", peerURLs=[https://10.0.0.5:2380}, clientURLs=[https://10.0.0.5:2379]
            member "https://10.0.0.4:2380" not found in member list but dataDir exists, check operator logs for possible scaling problems

      Expected results:

          Automatic reconciliation due to the non-restore'd etcd members moving their data directories automatically and joining the new cluster.

      Additional info:

          

            [OCPBUGS-52181] Cluster ID is not updated when --force-new-cluster is supplied

            There are no comments yet on this issue.

              alray@redhat.com Allen Ray
              alray@redhat.com Allen Ray
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: