Uploaded image for project: 'Docs for Red Hat Developers'
  1. Docs for Red Hat Developers
  2. RHDEVDOCS-2679

Document "Move Elasticsearch Operator from Operator maturity level 3 to 4"

XMLWordPrintable

    • devex docs #207 Sep 9-Sep 30, devex docs #208 Sep 30-Oct 21
    • 3
    • Documentation (Ref Guide, User Guide, etc.)
    • Engineering project marker - not a doc task.

      Goals

      An elasticsearch cluster can be difficult to administrate in conjunction with an OCP cluster. As such, we want to provide comprehensive alerts paired with easy steps to take to further troubleshoot and resolve these alerts so that an user's cluster can be kept in a healthy state.

       

      • Improved alerting
      • Appropriate severity set for different alerts
      • Quick copy/paste commands to run
      • Clear explanations and ways to resolve errors based on other copy/paste commands

      Non-Goals

      • Automatically taking steps based on alerting

      Motivation

      As we improve the means to troubleshoot a cluster, it should reduce the number of bugs/requests that come in that are the result of a cluster that is not healthy. This would free up our time for actual bugs and new features, rather than troubleshooting clusters.

      Alternatives

      Dedicate resources to train for hosted clusters and produce documentation for customers to troubleshoot.

      Acceptance Criteria

      • Verify EO provides alerts with severity based on how critical the action is for the cluster to stay healthy (think, getting paged in the middle of the night if this happens)
      • Verify that the alerts properly link to playbooks/guides using the runbook_url
      • Verify that the steps outlined in the playbooks are clear and concise (provide copy/paste commands), and working.

      Risk and Assumptions

      • This will be a continuous thing to iterate on as we find alerts are either too relaxed or too strict
      • We should prioritize Critical alerts being easy to resolve

      Documentation Considerations

      Currently we have a location for troubleshooting in the EO repo, we need to have an official docs location as well. (We may need to have different playbook url locations for upstream vs downstream)

      Open Questions

      Do our dashboards provide a good sense of cluster health and can be a supplement to the alerts? (Are customers using them)

      Additional Notes

      Guiding questions to determine Operator reaching Level 4

      • Does your Operator expose a health metrics endpoint?
      • Does your Operator expose Operand alerts?
      • Do you have Standard Operating Procedures (SOPs) for each alert?
      • Does you operator create critical alerts when the service is down and warning alerts for all other alerts?
      • Does your Operator watch the Operand to create alerts?
      • Does your Operator emit custom Kubernetes events?
      • Does your Operator expose Operand performance metrics?

              rdlugyhe Rolfe Dlugy-Hegwer
              rdlugyhe Rolfe Dlugy-Hegwer
              Rolfe Dlugy-Hegwer Rolfe Dlugy-Hegwer
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: