-
Bug
-
Resolution: Done-Errata
-
Normal
-
OSSM 2.5.2
-
None
The operator metrics are extremely weird and also broken.
1. The operator serves two different sets of metrics on two different ports (8383 and 8686).
Port 8383 serves the usual metrics:
controller_runtime_reconcile_time_seconds_bucket{controller="servicemeshcontrolplane-controller",le="0.005"} 12 process_cpu_seconds_total 2.11 rest_client_request_latency_seconds_bucket{url="https://10.217.4.1:443/%7Bprefix%7D",verb="GET",le="0.016"} 311 workqueue_adds_total{name="servicemeshcontrolplane-controller"} 20 ...
Port 8686 only serves some CRD-related metrics:
# HELP servicemeshcontrolplane_info Information about the ServiceMeshControlPlane custom resource. # TYPE servicemeshcontrolplane_info gauge # HELP servicemeshmember_info Information about the ServiceMeshMember custom resource. # TYPE servicemeshmember_info gauge # HELP servicemeshmemberroll_info Information about the ServiceMeshMemberRoll custom resource. # TYPE servicemeshmemberroll_info gauge # HELP servicemeshcontrolplane_info Information about the ServiceMeshControlPlane custom resource. # TYPE servicemeshcontrolplane_info gauge
It's not clear why the metrics are split across two different endpoints.
2. The CRD-related metrics are always empty, because the component producing these metrics watches for CRs only in the operator namespace, where there are never any CRs (SMCP, SMMR, SMM). It should instead watch all namespaces.
3. For each of the two sets of metrics, we create a separate kube client instead of reusing the global kube client that's used by the controllers. Because of this:
- The operator's memory usage and the number of API watches are higher than they need to be (each client has its own local cache; the operator thus keeps multiple copies of the same resources)
- On startup, the operator performs API discovery multiple times. API discovery is an expensive operation due to the large number of required API requests. This number depends highly on the number of APIs registered in the API server (big production clusters with many different operators installed have many APIs). This not only slows down the operator startup, but also puts unnecessary load on the API server.
—
Ideally, we should:
- serve all metrics on a single endpoint
- fix the empty CRD metrics
- use a single kube client
- causes
-
OSSM-6658 Operator sends many unnecessary API requests on startup
- Closed
- links to
-
RHSA-2024:135884 Red Hat OpenShift Service Mesh Containers for 2.6.0
- mentioned on