Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-20467

[2118367] KubeVirtComponentExceedsRequestedMemory Prometheus Rule is Failing to Evaluate

    XMLWordPrintable

Details

    • Medium

    Description

      +++ This bug was initially created as a clone of Bug #2033077 +++

      Description of problem:

      Received alerts from the two prometheus pods:
      openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m.

      openshift-monitoring/prometheus-k8s-1 has failed to evaluate 10 rules in the last 5m.

      Version-Release number of selected component (if applicable):
      OpenShift 4.9.10
      CNV 4.9.1

      How reproducible:
      Unsure, but error occurs continually. This is on an upgraded cluster (4.8 -> 4.9.) Not sure if it can be reproduced on a fresh cluster

      Steps to Reproduce:
      1. Have cluster running latest CNV, and OpenShift v4.8.22
      2. Upgrade cluster to 4.9.10

      Actual results:
      Cluster begins firing alerts failing to evaluate a prometheus rule.

      Expected results:
      Prometheus happily evaluates all the CNV alerting rules

      Additional info:
      The alert that is specifically failing is KubeVirtComponentExceedsRequestedMemory.

      The error is:
      found duplicate series for the match group

      {pod="bridge-marker-dv592"}

      on the right hand-side of the operation: [{__name__="container_memory_usage_bytes", container="bridge-marker", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-bca86b7bfe14679147f29a0a806f04ee9f8ceb6008f5b6bd58e9be4b2f5e35e8.scope", image="registry.redhat.io/container-native-virtualization/bridge-marker@sha256:83d6f2fbf4118162aed2d2b0153b4ad39cfe3b97a3ef06e9c4fbb5e2a3aae915", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_bridge-marker_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}, {__name__="container_memory_usage_bytes", container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-e346225c7c5270220cb6b2cce4de9f528c63603b2ba2c87be1e5642f0ac57b0f.scope", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side

      The contents of the rule:
      Expression

      ((kube_pod_container_resource_requests

      {container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}) - on(pod) group_left(node) container_memory_usage_bytes{namespace="openshift-cnv"}) < 0

      Testing that rule in the alerting dashboard also returns the error.

      NOTE: the similarly named KubeVirtComponentExceedsRequestedCPU does not appear to be failing, and is slightly different:

      ((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="cpu"}) - on(pod) group_left(node) node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{namespace="openshift-cnv"}) < 0

      Noting the difference after 'group_left(node)...', I tried replacing `container_memory_usage_bytes{namespace="openshift-cnv"}` with `node_namespace_pod_container:container_memory_working_set_bytes:sum_rate{namespace="openshift-cnv"}` in the rule and testing in the alerting console returns no error. So

      ((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}

      ) - on(pod) group_left(node) node_namespace_pod_container:container_memory_working_set_bytes:sum_rate

      {namespace="openshift-cnv"}

      ) < 0

      seems to work as expected.

      — Additional comment from Denys Shchedrivyi on 2022-01-05 20:36:44 UTC —

      I see the same error message in fresh installed CNV 4.10

      — Additional comment from Katya Gordeeva on 2022-01-07 13:15:21 UTC —

      We are seeing a spike in 4.9+ firing PrometheusRuleFailures alert, the data is a bit patchy but for the cluster we were able to get logs this bug was causing the alert. More info on the investigation is here: https://coreos.slack.com/archives/C029BREDBSM/p1641274801239700

      — Additional comment from on 2022-01-25 16:46:47 UTC —

      Removing the target version from this BZ to ensure it will be properly triaged.

      — Additional comment from on 2022-01-26 21:04:15 UTC —

      Deferring this to the next y-stream as the impact is that a non-critical alert is unable to fire.

      — Additional comment from Sascha Grunert on 2022-03-16 10:51:23 UTC —

      This may be an appropriate fix: https://github.com/kubevirt/kubevirt/pull/7372

      — Additional comment from Denys Shchedrivyi on 2022-04-19 18:20:53 UTC —

      On the latest build I see that initial error fixed (prometheus rule evaluation), but now I see the alert KubeVirtComponentExceedsRequestedMemory triggered right after cluster installation, screenshot attached.

      — Additional comment from Denys Shchedrivyi on 2022-04-19 18:24:29 UTC —

      — Additional comment from Antonio Cardace on 2022-05-09 16:00:44 UTC —

      @dshchedr@redhat.com the issue reported in this bz seems to be fixed, do you see another problem after the patch? If so can you open a new bug?

      — Additional comment from errata-xmlrpc on 2022-05-09 16:06:10 UTC —

      This bug has been added to advisory RHEA-2022:88137 by Antonio Cardace (acardace@redhat.com)

      — Additional comment from Denys Shchedrivyi on 2022-05-16 23:15:48 UTC —

      Verified on CNV-v4.11.0-334

      no messages about evaluating fails.

      — Additional comment from Oscar Casal Sanchez on 2022-07-26 15:14:09 UTC —

      Hello,

      I can see that it's saying for being fixed in OCP 4.11, is not going to be backported to previous versions or should we request it now that it's verified?

      Thank you so much,
      Oscar

      — Additional comment from Igor Bezukh on 2022-08-15 12:55:10 UTC —

      Hi,

      I will open a backport BZ for 4.9.6 and will backport the fix in upstream.

      Attachments

        Issue Links

          Activity

            People

              ibezukh Igor Bezukh
              ibezukh Igor Bezukh
              Akriti gupta Akriti gupta
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: