Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-15429

[2033077] KubeVirtComponentExceedsRequestedMemory Prometheus Rule is Failing to Evaluate

XMLWordPrintable

    • Medium
    • No

      Description of problem:

      Received alerts from the two prometheus pods:
      openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m.

      openshift-monitoring/prometheus-k8s-1 has failed to evaluate 10 rules in the last 5m.

      Version-Release number of selected component (if applicable):
      OpenShift 4.9.10
      CNV 4.9.1

      How reproducible:
      Unsure, but error occurs continually. This is on an upgraded cluster (4.8 -> 4.9.) Not sure if it can be reproduced on a fresh cluster

      Steps to Reproduce:
      1. Have cluster running latest CNV, and OpenShift v4.8.22
      2. Upgrade cluster to 4.9.10

      Actual results:
      Cluster begins firing alerts failing to evaluate a prometheus rule.

      Expected results:
      Prometheus happily evaluates all the CNV alerting rules

      Additional info:
      The alert that is specifically failing is KubeVirtComponentExceedsRequestedMemory.

      The error is:
      found duplicate series for the match group

      {pod="bridge-marker-dv592"}

      on the right hand-side of the operation: [{__name__="container_memory_usage_bytes", container="bridge-marker", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-bca86b7bfe14679147f29a0a806f04ee9f8ceb6008f5b6bd58e9be4b2f5e35e8.scope", image="registry.redhat.io/container-native-virtualization/bridge-marker@sha256:83d6f2fbf4118162aed2d2b0153b4ad39cfe3b97a3ef06e9c4fbb5e2a3aae915", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_bridge-marker_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}, {__name__="container_memory_usage_bytes", container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-e346225c7c5270220cb6b2cce4de9f528c63603b2ba2c87be1e5642f0ac57b0f.scope", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side

      The contents of the rule:
      Expression

      ((kube_pod_container_resource_requests

      {container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}) - on(pod) group_left(node) container_memory_usage_bytes{namespace="openshift-cnv"}) < 0

      Testing that rule in the alerting dashboard also returns the error.

      NOTE: the similarly named KubeVirtComponentExceedsRequestedCPU does not appear to be failing, and is slightly different:

      ((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="cpu"}) - on(pod) group_left(node) node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{namespace="openshift-cnv"}) < 0

      Noting the difference after 'group_left(node)...', I tried replacing `container_memory_usage_bytes{namespace="openshift-cnv"}` with `node_namespace_pod_container:container_memory_working_set_bytes:sum_rate{namespace="openshift-cnv"}` in the rule and testing in the alerting console returns no error. So

      ((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}

      ) - on(pod) group_left(node) node_namespace_pod_container:container_memory_working_set_bytes:sum_rate

      {namespace="openshift-cnv"}

      ) < 0

      seems to work as expected.

              ibezukh Igor Bezukh
              dcritch1@redhat.com David Critch
              Denys Shchedrivyi Denys Shchedrivyi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: