XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Normal
Fix Version/s: Logging 5.2
Affects Version/s: None
Component/s: Log Storage
Labels:

Epic Name:
[ES] Deep Insights
Epic Status:
Done
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Goals

An elasticsearch cluster can be difficult to administrate in conjunction with an OCP cluster. As such, we want to provide comprehensive alerts paired with easy steps to take to further troubleshoot and resolve these alerts so that an user's cluster can be kept in a healthy state.

Improved alerting
Appropriate severity set for different alerts
Quick copy/paste commands to run
Clear explanations and ways to resolve errors based on other copy/paste commands

Non-Goals

Automatically taking steps based on alerting

Motivation

As we improve the means to troubleshoot a cluster, it should reduce the number of bugs/requests that come in that are the result of a cluster that is not healthy. This would free up our time for actual bugs and new features, rather than troubleshooting clusters.

Alternatives

Dedicate resources to train for hosted clusters and produce documentation for customers to troubleshoot.

Acceptance Criteria

Verify EO provides alerts with severity based on how critical the action is for the cluster to stay healthy (think, getting paged in the middle of the night if this happens)
Verify that the alerts properly link to playbooks/guides using the runbook_url
Verify that the steps outlined in the playbooks are clear and concise (provide copy/paste commands), and working.

Risk and Assumptions

This will be a continuous thing to iterate on as we find alerts are either too relaxed or too strict
We should prioritize Critical alerts being easy to resolve

Documentation Considerations

Currently we have a location for troubleshooting in the EO repo, we need to have an official docs location as well. (We may need to have different playbook url locations for upstream vs downstream)

Open Questions

Do our dashboards provide a good sense of cluster health and can be a supplement to the alerts? (Are customers using them)

Additional Notes

Guiding questions to determine Operator reaching Level 4

Does your Operator expose a health metrics endpoint?
Does your Operator expose Operand alerts?
Do you have Standard Operating Procedures (SOPs) for each alert?
Does you operator create critical alerts when the service is down and warning alerts for all other alerts?
Does your Operator watch the Operand to create alerts?
Does your Operator emit custom Kubernetes events?
Does your Operator expose Operand performance metrics?

is documented by

RHDEVDOCS-2679 Document "Move Elasticsearch Operator from Operator maturity level 3 to 4"

Closed

relates to

RFE-1646 Runbooks for Resolving the Default Alerts Configured in OCP 4.x

Rejected

OCPPLAN-6068 Increase the overall quality for OpenShift's OOTB alerting rules

Closed

links to

Operators Maturity Model

Assignee:: Igor Karpukhin (Inactive)

Reporter:: Eric Wolinetz (Inactive)

QA Contact:: Qiaoling Tang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2020/07/27 11:04 AM

Updated:: 2022/12/02 5:52 AM

Resolved:: 2021/09/09 1:30 PM

Details

Description

Goals

Non-Goals

Motivation

Alternatives

Acceptance Criteria

Risk and Assumptions

Documentation Considerations

Open Questions

Additional Notes

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates