-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
Expected outage window for API server responses
-
To Do
-
None
-
-
False
-
-
False
-
Not Selected
-
None
-
None
-
None
-
0
Enable graceful API server responses during planned maintenance by writing expected downtime information to disk. When the cluster enters a critical section with known API server outage, it writes an outage record to each control-plane host. The API server reads this record and returns HTTP 503 (Service Unavailable) with Retry-After headers, informing clients when to retry.
Acceptance Criteria
- API server returns 503 with Retry-After header during tracked outage windows
- Outage records are successfully written during SNO upgrades, reboots, and TNF quorum recovery
- Tests verify all three critical scenarios work correctly
Scope
In Scope
- Design and implementation of the on-disk outage record format
- Syncing mechanism to write outage records to control-plane hosts
- API server changes to read outage records and return 503 with Retry-After headers
- Tracking for SNO upgrades
- Tracking for SNO node reboots
- Tracking for TNF fencing-based quorum recovery (written by pacemaker)
- Testing to verify all three scenarios
Out of Scope
- Client-side retry logic implementation
- Monitoring/alerting for outage windows
- Support for other cluster topologies beyond SNO/TNF
- Support for other critical sections not listed above
Timeline
- Target: 2026