Uploaded image for project: 'OpenShift Service Mesh'
  1. OpenShift Service Mesh
  2. OSSM-6703

Remove metrics port 8686 because it always serves empty metrics

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Normal Normal
    • OSSM 2.6.0
    • OSSM 2.5.2
    • Customer Impact
    • None

      The operator metrics are extremely weird and also broken.

      1. The operator serves two different sets of metrics on two different ports (8383 and 8686).
      Port 8383 serves the usual metrics:

      controller_runtime_reconcile_time_seconds_bucket{controller="servicemeshcontrolplane-controller",le="0.005"} 12
      process_cpu_seconds_total 2.11
      rest_client_request_latency_seconds_bucket{url="https://10.217.4.1:443/%7Bprefix%7D",verb="GET",le="0.016"} 311
      workqueue_adds_total{name="servicemeshcontrolplane-controller"} 20
      ...
      

      Port 8686 only serves some CRD-related metrics:

      # HELP servicemeshcontrolplane_info Information about the ServiceMeshControlPlane custom resource.
      # TYPE servicemeshcontrolplane_info gauge
      # HELP servicemeshmember_info Information about the ServiceMeshMember custom resource.
      # TYPE servicemeshmember_info gauge
      # HELP servicemeshmemberroll_info Information about the ServiceMeshMemberRoll custom resource.
      # TYPE servicemeshmemberroll_info gauge
      # HELP servicemeshcontrolplane_info Information about the ServiceMeshControlPlane custom resource.
      # TYPE servicemeshcontrolplane_info gauge
      

      It's not clear why the metrics are split across two different endpoints.

      2. The CRD-related metrics are always empty, because the component producing these metrics watches for CRs only in the operator namespace, where there are never any CRs (SMCP, SMMR, SMM). It should instead watch all namespaces.

      3. For each of the two sets of metrics, we create a separate kube client instead of reusing the global kube client that's used by the controllers. Because of this:

      • The operator's memory usage and the number of API watches are higher than they need to be (each client has its own local cache; the operator thus keeps multiple copies of the same resources)
      • On startup, the operator performs API discovery multiple times. API discovery is an expensive operation due to the large number of required API requests. This number depends highly on the number of APIs registered in the API server (big production clusters with many different operators installed have many APIs). This not only slows down the operator startup, but also puts unnecessary load on the API server.

      Ideally, we should:

      • serve all metrics on a single endpoint
      • fix the empty CRD metrics
      • use a single kube client

            mluksa@redhat.com Marko Luksa
            mluksa@redhat.com Marko Luksa
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: