-
Spike
-
Resolution: Unresolved
-
Major
-
None
-
Product / Portfolio Work
-
False
-
-
False
-
None
-
None
-
None
Time Box: 1 sprint
Adaptable topology is a cluster-topology mode that adjusts cluster control-plane and infrastructure behavior based on the current number of control-plane and worker nodes. The goal of this spike is to update the enhancement proposal with the answers to the following questions.
Research Questions
- If we run a learner on a second control-plane node and the voter fails, can quorum restore promote the learner? Or can only former voters be restored with quorum?
- Can CEO enforce etcd safety for 1↔3 node transitions? DualReplica offloads this to pacemaker. Do we need something similar?
- Can we guarantee both learner promotions happen together (or neither happens)?
- What other mechanisms are needed to prevent unsafe etcd transitions?
Involved Teams
- Control Plane (cluster-etcd-operator)
Acceptance Criteria
- Determine learner promotion capabilities during voter failures
- Document etcd safety enforcement requirements for 1↔3 node transitions
- Design atomic learner promotion mechanism
- Identify additional safety mechanisms needed
- Update enhancement proposal with findings and recommendations
Additional Context
- Enhancement Proposal: https://github.com/openshift/enhancements/pull/1905
- Critical for preventing etcd data loss during topology transitions
- Related to risk mitigation: "Risk: etcd Data Loss If Transitions Are Not Atomic"
- May require coordination with Control Plane team