[OCPBUGS-12525] node role is calculated twice in thanos-querier API - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.0
Affects Version/s: premerge
Component/s: Monitoring
Labels:
None

Severity:
Moderate
Regression:
No
Sprint:
MON Sprint 235
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Before this update, Thanos Querier failed to de-duplicate metrics by node roles. This update fixes the issue so that Thanos Querier now properly de-duplicates metrics by node roles. link:https://issues.redhat.com/browse/OCPBUGS-12525[~~OCPBUGS-12525~~]

Show
* Before this update, Thanos Querier failed to de-duplicate metrics by node roles. This update fixes the issue so that Thanos Querier now properly de-duplicates metrics by node roles. link: https://issues.redhat.com/browse/OCPBUGS-12525 [ OCPBUGS-12525 ]
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

tested https://issues.redhat.com/browse/OCPBUGS-10387 with PR

launch 4.14-ci,openshift/cluster-monitoring-operator#1926 no-spot

3 masters, 3 workers, each node is with 4 cpus, no infra node

$ oc get node
NAME                                         STATUS   ROLES                  AGE   VERSION
ip-10-0-132-193.us-east-2.compute.internal   Ready    control-plane,master   23m   v1.26.2+d2e245f
ip-10-0-135-65.us-east-2.compute.internal    Ready    control-plane,master   23m   v1.26.2+d2e245f
ip-10-0-149-72.us-east-2.compute.internal    Ready    worker                 14m   v1.26.2+d2e245f
ip-10-0-158-0.us-east-2.compute.internal     Ready    worker                 14m   v1.26.2+d2e245f
ip-10-0-229-135.us-east-2.compute.internal   Ready    worker                 17m   v1.26.2+d2e245f
ip-10-0-234-36.us-east-2.compute.internal    Ready    control-plane,master   23m   v1.26.2+d2e245f

labels see below

control-plane: node-role.kubernetes.io/control-plane: ""
master: node-role.kubernetes.io/master: ""
worker: node-role.kubernetes.io/worker: ""

search with "cluster:capacity_cpu_cores:sum" on admin console "Observe -> Metrics", label_node_role_kubernetes_io=master and label_node_role_kubernetes_io="" are both calculated twice

Name                label_beta_kubernetes_io_instance_type    label_kubernetes_io_arch    label_node_openshift_io_os_id    label_node_role_kubernetes_io    prometheus            Value
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                master                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum  m6a.xlarge                amd64                rhcos                master                openshift-monitoring/k8s    12

checked from thanos-querier API, same result with that from console UI(console UI used thanos-querier API)

$ token=`oc create token prometheus-k8s -n openshift-monitoring`
$ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=cluster:capacity_cpu_cores:sum' | jq
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      },
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "label_node_role_kubernetes_io": "master",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      },
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      },
      {
        "metric": {
          "__name__": "cluster:capacity_cpu_cores:sum",
          "label_beta_kubernetes_io_instance_type": "m6a.xlarge",
          "label_kubernetes_io_arch": "amd64",
          "label_node_openshift_io_os_id": "rhcos",
          "label_node_role_kubernetes_io": "master",
          "prometheus": "openshift-monitoring/k8s"
        },
        "value": [
          1682394655.248,
          "12"
        ]
      }
    ]
  }
}

no such issue if we query the expr for "cluster:capacity_cpu_cores:sum" directly

Name                label_beta_kubernetes_io_instance_type    label_kubernetes_io_arch    label_node_openshift_io_os_id    label_node_role_kubernetes_io    prometheus             Value
cluster:capacity_cpu_cores:sum    m6a.xlarge                amd64                rhcos                                openshift-monitoring/k8s    12
cluster:capacity_cpu_cores:sum    m6a.xlarge                amd64                rhcos                master                openshift-monitoring/k8s    12

should do deduplication for thanos-querier API

Version-Release number of selected component (if applicable):

tested https://issues.redhat.com/browse/OCPBUGS-10387 with PR

How reproducible:

always

Steps to Reproduce:

1. see the description
2.
3.

Actual results:

node role is calculated twice in thanos-querier API

Expected results:

node role should be calculated only once in thanos-querier API

links to

Issue on Thanos upstream

openshift/thanos#112: OCPBUGS-12525: fallback Thanos to 0.30.2.

RHEA-2023:5006 rpm

Assignee:: Brian Burt

Reporter:: Junqi Zhao

QA Contact:: Junqi Zhao

Doc Contact:: Brian Burt

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/04/25 4:08 AM

Updated:: 2023/10/31 1:30 PM

Resolved:: 2023/10/31 12:56 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates