-
Epic
-
Resolution: Done
-
Critical
-
openshift-4.16
-
Automatic recovery from expired server and peer certs
-
Strategic Product Work
-
False
-
None
-
False
-
Not Selected
-
To Do
-
OCPSTRAT-1103 - [etcd] recovery from expired etcd server and peer certs
-
OCPSTRAT-1103[etcd] recovery from expired etcd server and peer certs
-
0% To Do, 0% In Progress, 100% Done
Epic Goal*
Provide a way to automatically recover a cluster with expired etcd server and peer certs
Why is this important? (mandatory)
Currently, the EtcdCertSigner controller, which is part of the CEO, renews the aforementioned certificates roughly every 3 years. However, if the cluster is offline for a period longer than the certificate's validity, upon restarting the cluster, the controller won't be able to renew the certificates since the operator won't be running at all.
We have scenarios where the customer, partner, or service delivery needs to recover a cluster that is offline, suspended, or shutdown, and as part of the process requires a supported way to force certificate and key rotation or replacement.
See the following doc for more use cases of when such clusters need to be recovered:
https://docs.google.com/document/d/198C4xwi5td_V-yS6w-VtwJtudHONq0tbEmjknfccyR0/edit
Required to enable emergency certificate rotation.
https://issues.redhat.com/browse/API-1613
https://issues.redhat.com/browse/API-1603
Scenarios (mandatory)
A cluster has etcd serving, peer and serving-metrics certificates that are expired. There should be a way to either trigger certificate rotation or have a process that automatically does the rotation.
This does not cover the expiration of etcd-signer certificates at this time.
That will be covered under https://issues.redhat.com/browse/ETCD-445
Dependencies (internal and external) (mandatory)
While the etcd team will implement the automatic recovery for the etcd certificates, other control-plane operators will be handling their own certificate recovery.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
- Development - etcd team
- Documentation - etcd docs team
- QE - etcd qe
- PX -
- Others -
Acceptance Criteria (optional)
When a openshift etcd cluster that has expired etcd server and peer certs is restarted and is able to regenerate those certs.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
- CI Testing - Having an e2e test that puts a cluster into the expired certs failure mode and forces it to recover.
- Documentation - Docs that explain the cert recovery procedure
- QE - Test scenarios are written and executed successfully.
- Technical Enablement - Slides are complete (if requested by PLM)
- Engineering Stories Merged
- All associated work items with the Epic are closed
- Epic status should be “Release Pending”
- is related to
-
OCPSTRAT-1103 [etcd] recovery from expired etcd server and peer certs
- Closed
- relates to
-
API-1613 ETCD: Automatic renewal of peer, serving and serving-metrics certificates in case of their expiration
- New