- Write runbooks (aka SOPs) for all our existing alerting rules with clear explanations and ways to resolve errors based on copy/paste commands.
- Provide metrics that illustrates the number of bytes produced by a container.
- Add new chart to the Core dashboard that shows a lis of the top containers producing logs based on the new metric.
- Add new chart to the Core dashboard that illustrates how many logs are lost during the collection time (not the "not stored" side).
- Automatic remediation.
- New alerting rules since we do not have anything actionable at the moment.
As we improve the means to troubleshoot a cluster, it should reduce the number of bugs/requests that come in that are the result of a cluster that is not healthy. Furthermore, it would extend our and the CEE teams ability to track down exactly where things fail. This would free up our time for actual bugs and new features, rather than troubleshooting clusters.
This information will also greatly help to better understand "noise neighbours" and as soon as our new Flow Control mechanism is in place, we can add alerting rules with adjustable thresholds where an admin can use that mechanism and "block" the collection of logs of a particularly noise container to avoid breaking the log pipeline for other containers.
Customers running scripts to collect information on how many logs every container in their environment produces.
- Verify the number of bytes produced by a single container matches the value of the collected metric + all labels attached to metric must match the respective container k8s metadata.
- Verify that all runbooks are functioning as intended.
- Verify that the "Top Containers" chart shows the correct list.
- Verify that the "Not processed logs" chart shows the correct value based on incoming number of logs (fluentd) - number of produced logs (conmon).
- Add links to the new runbooks for every alert we document.
- Document new dashboards and explain what they illustrate.
- Add new metrics we collect and explain what they illustrate.
- We need a note that explains that the numbers for the metrics exposed might not be exact.
Guiding questions to determine Operator reaching Level 4
- Does your Operator expose a health metrics endpoint?
- Does your Operator expose Operand alerts?
- Do you have Standard Operating Procedures (SOPs) for each alert?
- Does you operator create critical alerts when the service is down and warning alerts for all other alerts?
- Does your Operator watch the Operand to create alerts?
- Does your Operator emit custom Kubernetes events?
- Does your Operator expose Operand performance metrics?