Overview:
A high level summary that describes the Epic in a clear, concise way. Complete during New status.
Implement certificate hot reload for all ACS services. See ROX-29432 for details regarding why this is needed.
Requirements:
A list of specific needs or objectives that an epic must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
When a new network connection is established, ACS services should use the latest version of TLS certificates that is available in their respective k8s secrets, and not cache them on first read like they currently do.
The leaf certificates as well as the CA certificates should be reloaded.
Technical Scope:
High-level list of items that are in scope; usually completed by a staff engineer or a lead from the Feature Delivery Team. Initial completion during Refinement status.
- certificate monitoring - watch for TLS certificates changes (reuse certwatch package in Go code, possibly reuse configuration hotreloading code in Collector)
- modify server components to reload server certificates, private keys, and client CA pools without a restart
- modify client components to reload their client certificates, private keys, and trust server CA pools without a restart
- connection handling: new connections must use the new TLS certificates (both client and server-side), existing connections should continue
- sidecar pods for watching certificates where needed (secured cluster services that use the init-tls-certs init-container, Postgres pods)
Out of Scope:
High-level list of items that are out of scope. Initial completion during Refinement status.
- certificate rotation / refresh logic
- implementation of short-lived certificates
Outstanding Questions (Optional):
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Given the potential complexity of implementing full in-process hot reloading across all ACS services, should we evaluate a simpler alternative or interim solution? A sidecar that detects certificates changes and restarts the main container would most likely take much less effort to implement, but it has the downside that it causes a short downtime when certificates are refreshed. This is not very important right now (we refresh leaf certificates every 6 months, and internal CAs are rotated every 3 years), but could become more problematic if we want to have short-lived mTLS certificates.