-
Story
-
Resolution: Done
-
Undefined
-
None
-
3
-
False
-
-
False
-
-
CNV-8094 - CNV Observability
-
---
-
---
-
CNV I/U Operators Sprint 229, CNV I/U Operators Sprint 230
-
None
The KubeVirt metrics code is currently embedded in the heart of the operator code.
This causes issues with code readability, code complexity and maintainability etc.
There a 2 suggestion for improving the monitoring code in KubeVirt:
1. Kubevirt monitoring to be done externally
This story is about re-designing our monitoring components so they are developed and deployed externally to Kubevirt. In other words, monitoring components would be developed in a different repository and would be deployed separately (similarly to CDI, for example).
This has many advantages, for example:
- Enhanced development speed
- Decoupling monitoring code from operator code
- Enhanced security: monitor publicly available data only
- Becoming more modular and generic - resilient to future changes
Note that this approach is generic: it could be applied any operator, not only to Kubevirt. In addition, it can used to export data not only to Prometheus, but to other tools as well.
For more info, motivation, goals and architecture design, please look at:
https://github.com/kubevirt/community/pull/189
(By the time of writing this the design proposal is still a draft. Many Changes are expected. Feedback is much appreciated)
2. Create a monitoring directory for each operator repository and in it to have all the monitoring (metrics, alerts, runbooks) logic.
More details can be found here https://docs.google.com/document/d/1L2lcri3SogFhjaIutbVdvnSkFNMrXeBQ7L0Zs_izJzM/edit?usp=sharing
And an example is here https://github.com/operator-framework/operator-sdk/pull/5996
In this spike we need to determine which of the implementations is better for KubeVirt and if it can also be considered a best practices for other operators that want to have monitoring.
Things to consider during the evaluation of the different alternatives:
1. Be able to have all the labels that are taken from the environment like namespace, pod, container, instance, job, endpoint etc.
2. Be able to collect and report both metrics based on changes in the environment for the resources and metrics that should be collected periodically like CPU, Memory etc.
3. Have a way to catch braking changes on the operators core code so that we don't get bugs in the monitoring side.
- split to
-
CNV-24647 [contd] [spike] Plan the monitoring code refactoring
- Closed
- links to