XML

Word

Printable

Type: Task
Resolution: Done
Priority: Normal
Fix Version/s: Logging 5.3
Affects Version/s: Logging 5.2
Component/s: Logging
Labels:
- stretch-goal

Sprint:
devex docs #207 Sep 9-Sep 30, devex docs #208 Sep 30-Oct 21
Story Points:
3
Affects:

Documentation (Ref Guide, User Guide, etc.)
Release Note Text:
Engineering project marker - not a doc task.
Git Pull Request:
https://github.com/openshift/openshift-docs/pull/36624

Goals

An elasticsearch cluster can be difficult to administrate in conjunction with an OCP cluster. As such, we want to provide comprehensive alerts paired with easy steps to take to further troubleshoot and resolve these alerts so that an user's cluster can be kept in a healthy state.

Improved alerting
Appropriate severity set for different alerts
Quick copy/paste commands to run
Clear explanations and ways to resolve errors based on other copy/paste commands

Non-Goals

Automatically taking steps based on alerting

Motivation

As we improve the means to troubleshoot a cluster, it should reduce the number of bugs/requests that come in that are the result of a cluster that is not healthy. This would free up our time for actual bugs and new features, rather than troubleshooting clusters.

Alternatives

Dedicate resources to train for hosted clusters and produce documentation for customers to troubleshoot.

Acceptance Criteria

Verify EO provides alerts with severity based on how critical the action is for the cluster to stay healthy (think, getting paged in the middle of the night if this happens)
Verify that the alerts properly link to playbooks/guides using the runbook_url
Verify that the steps outlined in the playbooks are clear and concise (provide copy/paste commands), and working.

Risk and Assumptions

This will be a continuous thing to iterate on as we find alerts are either too relaxed or too strict
We should prioritize Critical alerts being easy to resolve

Documentation Considerations

Currently we have a location for troubleshooting in the EO repo, we need to have an official docs location as well. (We may need to have different playbook url locations for upstream vs downstream)

Open Questions

Do our dashboards provide a good sense of cluster health and can be a supplement to the alerts? (Are customers using them)

Additional Notes

Guiding questions to determine Operator reaching Level 4

Does your Operator expose a health metrics endpoint?
Does your Operator expose Operand alerts?
Do you have Standard Operating Procedures (SOPs) for each alert?
Does you operator create critical alerts when the service is down and warning alerts for all other alerts?
Does your Operator watch the Operand to create alerts?
Does your Operator emit custom Kubernetes events?
Does your Operator expose Operand performance metrics?

documents

LOG-860 Move Elasticsearch Operator from Operator maturity level 3 to 4

Closed

LOG-1142 Expose recommendations based on cluster health

Closed

links to

openshift/openshift-docs#36624: RHDEVDOCS-2679 Document "Move Elasticsearch Operator from Operator ma…

openshift/openshift-docs#36989: [enterprise-4.8] RHDEVDOCS-2679 Document "Move Elasticsearch Operator from Operator ma…

openshift/openshift-docs#36990: [enterprise-4.9] RHDEVDOCS-2679 Document "Move Elasticsearch Operator from Operator ma…

Operators Maturity Model

(1 links to)

Assignee:: Rolfe Dlugy-Hegwer

Reporter:: Rolfe Dlugy-Hegwer

Writer:: Rolfe Dlugy-Hegwer

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2021/02/15 2:20 PM

Updated:: 2022/12/02 5:41 AM

Resolved:: 2021/10/01 6:06 PM

Details

Description

Goals

Non-Goals

Motivation

Alternatives

Acceptance Criteria

Risk and Assumptions

Documentation Considerations

Open Questions

Additional Notes

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates