-
Task
-
Resolution: Done
-
Normal
-
Logging 5.2
-
devex docs #207 Sep 9-Sep 30, devex docs #208 Sep 30-Oct 21
-
3
-
Documentation (Ref Guide, User Guide, etc.)
-
Engineering project marker - not a doc task.
Goals
An elasticsearch cluster can be difficult to administrate in conjunction with an OCP cluster. As such, we want to provide comprehensive alerts paired with easy steps to take to further troubleshoot and resolve these alerts so that an user's cluster can be kept in a healthy state.
- Improved alerting
- Appropriate severity set for different alerts
- Quick copy/paste commands to run
- Clear explanations and ways to resolve errors based on other copy/paste commands
Non-Goals
- Automatically taking steps based on alerting
Motivation
As we improve the means to troubleshoot a cluster, it should reduce the number of bugs/requests that come in that are the result of a cluster that is not healthy. This would free up our time for actual bugs and new features, rather than troubleshooting clusters.
Alternatives
Dedicate resources to train for hosted clusters and produce documentation for customers to troubleshoot.
Acceptance Criteria
- Verify EO provides alerts with severity based on how critical the action is for the cluster to stay healthy (think, getting paged in the middle of the night if this happens)
- Verify that the alerts properly link to playbooks/guides using the runbook_url
- Verify that the steps outlined in the playbooks are clear and concise (provide copy/paste commands), and working.
Risk and Assumptions
- This will be a continuous thing to iterate on as we find alerts are either too relaxed or too strict
- We should prioritize Critical alerts being easy to resolve
Documentation Considerations
Currently we have a location for troubleshooting in the EO repo, we need to have an official docs location as well. (We may need to have different playbook url locations for upstream vs downstream)
Open Questions
Do our dashboards provide a good sense of cluster health and can be a supplement to the alerts? (Are customers using them)
Additional Notes
Guiding questions to determine Operator reaching Level 4
- Does your Operator expose a health metrics endpoint?
- Does your Operator expose Operand alerts?
- Do you have Standard Operating Procedures (SOPs) for each alert?
- Does you operator create critical alerts when the service is down and warning alerts for all other alerts?
- Does your Operator watch the Operand to create alerts?
- Does your Operator emit custom Kubernetes events?
- Does your Operator expose Operand performance metrics?