Uploaded image for project: 'Knative Serving'
  1. Knative Serving
  2. SRVKS-720

Too many metrics in webhook_request_latencies_bucket{job="webhook-sm-service"}

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • 1.15.0
    • 1.14.0
    • None
    • False
    • False
    • Undefined

      On a cluster with ~50 namespaces containing a ksvc, the webhook_request_latencies_bucket{job="webhook-sm-service"} is the single largest metric on the cluster with 29801 time series (which is more than the API servers'

      examples

      webhook_request_latencies_bucket{admission_allowed="false", endpoint="https", instance="10.128.2.33:8444", job="webhook-sm-service", kind_group="networking.internal.knative.dev", kind_kind="Ingress", kind_version="v1alpha1", le="+Inf", namespace="knative-serving", pod="webhook-576b57b4d6-glxhp", request_operation="UPDATE", resource_group="networking.internal.knative.dev", resource_namespace="helloworld-re-sn-0", resource_resource="ingresses", resource_version="v1alpha1", service="webhook-sm-service"}   436
      
      webhook_request_latencies_bucket{admission_allowed="false", endpoint="https", instance="10.128.2.33:8444", job="webhook-sm-service", kind_group="networking.internal.knative.dev", kind_kind="Ingress", kind_version="v1alpha1", le="+Inf", namespace="knative-serving", pod="webhook-576b57b4d6-glxhp", request_operation="UPDATE", resource_group="networking.internal.knative.dev", resource_namespace="helloworld-re-sn-1", resource_resource="ingresses", resource_version="v1alpha1", service="webhook-sm-service"}   424
      

      It seems it has a time series for each

      bucket * namespace * kind * operation
      

      for any namespace with a ksvc, for all kinds of knative kinds...

      ( there are 17 buckets, and (kind * operation) seems to be around 32, so there are about 544 time series per namespace)

      Notice the `resource_namespace` label, which means there are bucket time series for each namespace... (also note that even the OpenShift apiserver_request_duration_seconds_bucket metric doesn't distinguish namespaces, so it IMHO seems excessive to distinguish it for the webhook_request_latencies_bucket metric))

      See also the discussion https://coreos.slack.com/archives/CD87JDUB0/p1619024836262000

        1. image-2021-04-23-15-40-10-548.png
          111 kB
          Stavros Kontopoulos
        2. image-2021-04-23-15-41-09-467.png
          97 kB
          Stavros Kontopoulos
        3. image-2021-04-23-15-45-15-976.png
          196 kB
          Stavros Kontopoulos
        4. image-2021-04-23-15-45-29-618.png
          194 kB
          Stavros Kontopoulos
        5. prometheus_too_many_metrics.png
          178 kB
          Marek Schmidt

              Unassigned Unassigned
              maschmid@redhat.com Marek Schmidt
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: