OpenShift Container Platform is shipping a finely tuned set of alerts to inform the cluster's owner and/or operator of events and bad conditions in the cluster.
Runbooks are associated with alerts and help SREs take action to resolve an alert. This is critical to share engineering best practices following an incident.
Goal 1: Current alerts/runbooks for hypershift needs to be evaluated to ensure we have sufficient coverage before hypershift hits GA.
Goal 2: Actionable runbooks need to be provided for all alerts therefore, we should attempt to cover as many as possible in this epic.
Goal 3: Continue adding alerts/runbooks to cover existing OVN-K functionality.
This epic will NOT cover refactors needed to alerts/runbooks due to new arch (OVN IC).
In-order to scale, we (engineering) must share our institutional knowledge.
In-order for SREs to respond to alerts, they must have the knowledge to do so.
SD needs to have actionable runbooks to respond to alerts otherwise, they will require engineering to engage more frequently.
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>