[OCPBUGS-52181] Cluster ID is not updated when --force-new-cluster is supplied

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.19
Component/s: Etcd
Labels:
None

Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Type:
Release Note Not Required
Release Note Status:
In Progress
Target Version:

4.19.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

    When using --force-new-cluster, in OCP through quorum-restore, the cluster id *should* change to reflect the new membership (cluster id is generated from the etcd membership); however, the cluster id is just read from the WAL and set so that the "new" cluster has the same id as the old.

For quorum-restore, the non-restore'd nodes expect the cluster id to have changed in order for them to automatically move their data directory so they can rejoin the cluster (https://github.com/openshift/etcd/pull/284/files#diff-3d1414e0d47047e4fdcae958fb7654d082c1eb1b71cf4eb38aeb3190db678208R182). Quorum-restore still works as intended as the non-restore'd members will eventually crashloop since they're unable to join the restore'd cluster as their member information mismatches the restore'd cluster. They crashloop after a time, and eventually the CEO will notice that they're trying to join and help them along to rejoin the cluster. Therefore, the end effect is correct, but the path taken is unintended.

Version-Release number of selected component (if applicable):

How reproducible:

    always

Steps to Reproduce:

    1. Start with standard Openshift deployment
    2. unmanage the CEO (https://github.com/openshift/cluster-etcd-operator/blob/main/hack/unmanage.sh)
    3. oc debug into a node
    4. run /usr/local/bin/quorum-restore.sh
    5. (exit)
    6. oc logs into the non-recovery etcd pod
    7. see that the pods are attempting to reconnect and can't due to membership mismatch - also note that the cluster id's are the same

Actual results:

    #### attempt 7
Live Cluster ID: [18c4c43dc346df9a], local: [18c4c43dc346df9a] 
      member={name="", peerURLs=[https://10.0.0.3:2380}, clientURLs=[]
      member={name="ci-ln-wj3h05t-72292-tsc2b-master-0", peerURLs=[https://10.0.0.5:2380}, clientURLs=[https://10.0.0.5:2379]
      member "https://10.0.0.4:2380" not found in member list but dataDir exists, check operator logs for possible scaling problems

Expected results:

    Automatic reconciliation due to the non-restore'd etcd members moving their data directories automatically and joining the new cluster.

Additional info:

links to

openshift/etcd#313: OCPBUGS-52181: Ensure cluster id changes during force-new-cluster

There are no comments yet on this issue.

Assignee:: Allen Ray

Reporter:: Allen Ray

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/02/28 4:29 PM

Updated:: 2025/03/28 12:31 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates