Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-2939

Automate and simplify Day-2 control plane node replacement operation

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview

      This feature aims to automate and simplify the day-2 replacement of control plane (master) nodes for clusters deployed via the Agent-Based Installer (ABI) and Agent Installer for OVE on Bare Metal. Currently, replacing a failed control plane node in these environments is a manual, multi-step process. This enhancement will provide a streamlined, automated workflow within the Agent service to restore cluster quorum and etcd health without requiring manual intervention in the underlying hardware management layer.

      Goals

      • Automated Quorum Recovery: Provide an automated mechanism to remove a failed etcd member and join a new control plane node using the Agent-based and Agent Installer for OVE workflow.
      • Persona: Cluster Administrators who need to maintain high availability (HA) in disconnected or edge environments with minimal operational overhead.

      Requirements

      • Functional:
        • The Agent service must be able to detect or be notified of a failed control plane node and initiate a replacement sequence.
        • Automated removal of the unhealthy etcd member from the cluster.
        • Generation of a new discovery ISO or ignition configuration specifically tailored for the replacement node to join the existing control plane.
        • Support for "In-place" replacement (using the same IP/hostname) and "New-node" replacement (different IP/hostname).
      • Non-Functional:
        • Reliability: The process must ensure that etcd quorum is never lost during the replacement of a single node.
        • Usability: The workflow should be integrated into existing oc CLI or Assisted Console patterns to ensure consistency across installation types.
        • Security: Ensure that new nodes joining the control plane are authenticated via standard CSR (Certificate Signing Request) workflows.

      Use Case: Problem Statement

      Scenario: An organization operates a 3-node bare metal cluster at a remote edge site deployed via ABI or Agent Installer for OVE. One physical master node suffers a hardware failure.

      • The Problem: The administrator does not have BMC/IPMI access to the server to trigger a remote reinstall. Current documentation is tailored for IPI/automated-provisioning or requires complex manual etcd commands that are prone to human error.
      • The Impact: The cluster remains in a degraded state with no clear, supported path to restore the third control plane node using the Agent-based toolset, risking total cluster loss if a second node fails.

      User Scenario:

      "As a Cluster Administrator, I want to automate the replacement of a failed master node via the Agent-Based Installer and Agent Installer for OVE so that I can restore my cluster's health and etcd quorum."

      Out of Scope

      • Automated hardware provisioning (e.g., PXE server configuration) for the replacement physical server.
      • Replacement of multiple control plane nodes simultaneously (recovery from total quorum loss).
      • Automated replacement for IPI (Installer Provisioned Infrastructure) which already utilizes Machine API for this purpose.

      Questions to Answer

      •  

      Links

      • OCPSTRAT-2514 - Control Plane Node Replacement Procedure
      • OpenShift 4.20 Documentation:[ Replacing an unhealthy etcd member|https://docs.openshift.com/container-platform/4.17/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html]

              mzasepa Michal Zasepa
              mzasepa Michal Zasepa
              None
              None
              None
              None
              Avani Bhatt Avani Bhatt
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: