Goals

Write runbooks (aka SOPs) for all our existing alerting rules with clear explanations and ways to resolve errors based on copy/paste commands.
Provide metrics that illustrates the number of bytes produced by a container.
Add new chart to the Core dashboard that shows a lis of the top containers producing logs based on the new metric.
Add new chart to the Core dashboard that illustrates how many logs are lost during the collection time (not the "not stored" side).

Non-Goals

Automatic remediation.
New alerting rules since we do not have anything actionable at the moment.

Motivation

As we improve the means to troubleshoot a cluster, it should reduce the number of bugs/requests that come in that are the result of a cluster that is not healthy. Furthermore, it would extend our and the CEE teams ability to track down exactly where things fail. This would free up our time for actual bugs and new features, rather than troubleshooting clusters.

This information will also greatly help to better understand "noise neighbours" and as soon as our new Flow Control mechanism is in place, we can add alerting rules with adjustable thresholds where an admin can use that mechanism and "block" the collection of logs of a particularly noise container to avoid breaking the log pipeline for other containers.

Alternatives

Customers running scripts to collect information on how many logs every container in their environment produces.

Acceptance Criteria

Verify the number of bytes produced by a single container matches the value of the collected metric + all labels attached to metric must match the respective container k8s metadata.
Verify that all runbooks are functioning as intended.
Verify that the "Top Containers" chart shows the correct list.
Verify that the "Not processed logs" chart shows the correct value based on incoming number of logs (fluentd) - number of produced logs (conmon).

Risk and Assumptions

Documentation Considerations

Add links to the new runbooks for every alert we document.
Document new dashboards and explain what they illustrate.
Add new metrics we collect and explain what they illustrate.
We need a note that explains that the numbers for the metrics exposed might not be exact.

Open Questions

Additional Notes

Guiding questions to determine Operator reaching Level 4

Does your Operator expose a health metrics endpoint?
Does your Operator expose Operand alerts?
Do you have Standard Operating Procedures (SOPs) for each alert?
Does you operator create critical alerts when the service is down and warning alerts for all other alerts?
Does your Operator watch the Operand to create alerts?
Does your Operator emit custom Kubernetes events?
Does your Operator expose Operand performance metrics?

Attachments

Issue Links

is documented by

OBSDOCS-151 Document "Move Logging Operator from Operator maturity level 3 to 4"

Closed

relates to

RFE-1646 Runbooks for Resolving the Default Alerts Configured in OCP 4.x

Rejected

OCPPLAN-6068 Increase the overall quality for OpenShift's OOTB alerting rules

Closed

Activity

People

Assignee:: Alan Conway

Reporter:: Christian Heidenreich (Inactive)

QA Contact:: Ishwar Kanse

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2020/08/31 4:38 AM

Updated:: 2022/11/30 3:23 PM

Resolved:: 2022/11/30 3:22 PM