An elasticsearch cluster can be difficult to administrate in conjunction with an OCP cluster. As such, we want to provide comprehensive alerts paired with easy steps to take to further troubleshoot and resolve these alerts so that an user's cluster can be kept in a healthy state.
- Improved alerting
- Appropriate severity set for different alerts
- Quick copy/paste commands to run
- Clear explanations and ways to resolve errors based on other copy/paste commands
- Automatically taking steps based on alerting
As we improve the means to troubleshoot a cluster, it should reduce the number of bugs/requests that come in that are the result of a cluster that is not healthy. This would free up our time for actual bugs and new features, rather than troubleshooting clusters.
Dedicate resources to train for hosted clusters and produce documentation for customers to troubleshoot.
- Verify EO provides alerts with severity based on how critical the action is for the cluster to stay healthy (think, getting paged in the middle of the night if this happens)
- Verify that the alerts properly link to playbooks/guides using the runbook_url
- Verify that the steps outlined in the playbooks are clear and concise (provide copy/paste commands), and working.
- This will be a continuous thing to iterate on as we find alerts are either too relaxed or too strict
- We should prioritize Critical alerts being easy to resolve
Currently we have a location for troubleshooting in the EO repo, we need to have an official docs location as well. (We may need to have different playbook url locations for upstream vs downstream)
Do our dashboards provide a good sense of cluster health and can be a supplement to the alerts? (Are customers using them)
Guiding questions to determine Operator reaching Level 4
- Does your Operator expose a health metrics endpoint?
- Does your Operator expose Operand alerts?
- Do you have Standard Operating Procedures (SOPs) for each alert?
- Does you operator create critical alerts when the service is down and warning alerts for all other alerts?
- Does your Operator watch the Operand to create alerts?
- Does your Operator emit custom Kubernetes events?
- Does your Operator expose Operand performance metrics?