Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-2348

Label non-master label_node_role_kubernetes_io for cluster:node_instance_type_count:sum


    • Icon: Spike Spike
    • Resolution: Duplicate
    • Icon: Blocker Blocker
    • None
    • None
    • None
    • None
    • False
    • None
    • False
    • NEW
    • NEW
    • 0

      Today, theĀ cluster:node_instance_type_count:sum recording rule uses cluster:master_nodes for role labels. cluster:master_nodes sets label_node_role_kubernetes_io="master", but unsurprisingly does not include non-control-plane nodes. So today, cluster:node_instance_type_count:sum results look like:

      cluster:node_instance_type_count:sum{label_beta_kubernetes_io_instance_type="m5.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos"}  3
      cluster:node_instance_type_count:sum{label_beta_kubernetes_io_instance_type="m6i.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master"}  3

      I propose replacing the cluster:node_instance_type_count:sum recording rule with:

      count by (label_beta_kubernetes_io_instance_type, label_node_role_kubernetes_io, label_kubernetes_io_arch, label_node_openshift_io_os_id) (
        group by (node, label_beta_kubernetes_io_instance_type, label_node_role_kubernetes_io, label_kubernetes_io_arch, label_node_openshift_io_os_id) (
          + on (node) group_left (label_node_role_kubernetes_io)
          label_replace(kube_node_role, "label_node_role_kubernetes_io", "$1", "role", "(.*)")

      to give results like:

      {label_beta_kubernetes_io_instance_type="m5.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="worker"}  3
      {label_beta_kubernetes_io_instance_type="m6i.xlarge", label_kubernetes_io_arch="amd64", label_node_openshift_io_os_id="rhcos", label_node_role_kubernetes_io="master"}  3

      This would make it possible to determine from Telemetry how instance types, etc. are distributed among non-control-plane roles, and also make it easy to see which roles are being used in clusters. This information is already available via Insights, but having it in Telemetry makes aggregated statistics more accessible.

      The concern with this change is that it might break consumers if they rely on label_node_role_kubernetes_io being empty for all non-control-plane nodes. This ticket is about rounding with known consumers to see if they have concerns about the change. A blanket GitHub search turns up no external consumers, although obviously it's possible that some exist and just don't publish their consumers publicly on GitHub.

            hasun@redhat.com Haoyu Sun
            trking W. Trevor King
            0 Vote for this issue
            4 Start watching this issue