Loading...

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: CNV v4.10.5
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- cnv-4+
- cnvbugsm
- devel_ack+
- pm_ack+
- qa_ack+

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2118367
Bugzilla Bug:
RHBZ: 2118367
[QE] How to address?:
---
[QE] Why QE missed?:
---

Severity:
Medium

Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

+++ This bug was initially created as a clone of Bug #2033077 +++

Description of problem:

Received alerts from the two prometheus pods:
openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m.

openshift-monitoring/prometheus-k8s-1 has failed to evaluate 10 rules in the last 5m.

Version-Release number of selected component (if applicable):
OpenShift 4.9.10
CNV 4.9.1

How reproducible:
Unsure, but error occurs continually. This is on an upgraded cluster (4.8 -> 4.9.) Not sure if it can be reproduced on a fresh cluster

Steps to Reproduce:
1. Have cluster running latest CNV, and OpenShift v4.8.22
2. Upgrade cluster to 4.9.10

Actual results:
Cluster begins firing alerts failing to evaluate a prometheus rule.

Expected results:
Prometheus happily evaluates all the CNV alerting rules

Additional info:
The alert that is specifically failing is KubeVirtComponentExceedsRequestedMemory.

The error is:
found duplicate series for the match group

{pod="bridge-marker-dv592"}

on the right hand-side of the operation: [{__name__="container_memory_usage_bytes", container="bridge-marker", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-bca86b7bfe14679147f29a0a806f04ee9f8ceb6008f5b6bd58e9be4b2f5e35e8.scope", image="registry.redhat.io/container-native-virtualization/bridge-marker@sha256:83d6f2fbf4118162aed2d2b0153b4ad39cfe3b97a3ef06e9c4fbb5e2a3aae915", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_bridge-marker_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}, {__name__="container_memory_usage_bytes", container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-e346225c7c5270220cb6b2cce4de9f528c63603b2ba2c87be1e5642f0ac57b0f.scope", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side

The contents of the rule:
Expression

((kube_pod_container_resource_requests

{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}) - on(pod) group_left(node) container_memory_usage_bytes{namespace="openshift-cnv"}) < 0

Testing that rule in the alerting dashboard also returns the error.

NOTE: the similarly named KubeVirtComponentExceedsRequestedCPU does not appear to be failing, and is slightly different:

((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="cpu"}) - on(pod) group_left(node) node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{namespace="openshift-cnv"}) < 0

Noting the difference after 'group_left(node)...', I tried replacing `container_memory_usage_bytes{namespace="openshift-cnv"}` with `node_namespace_pod_container:container_memory_working_set_bytes:sum_rate{namespace="openshift-cnv"}` in the rule and testing in the alerting console returns no error. So

((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}

) - on(pod) group_left(node) node_namespace_pod_container:container_memory_working_set_bytes:sum_rate

{namespace="openshift-cnv"}

) < 0

seems to work as expected.

— Additional comment from Denys Shchedrivyi on 2022-01-05 20:36:44 UTC —

I see the same error message in fresh installed CNV 4.10

— Additional comment from Katya Gordeeva on 2022-01-07 13:15:21 UTC —

We are seeing a spike in 4.9+ firing PrometheusRuleFailures alert, the data is a bit patchy but for the cluster we were able to get logs this bug was causing the alert. More info on the investigation is here: https://coreos.slack.com/archives/C029BREDBSM/p1641274801239700

— Additional comment from on 2022-01-25 16:46:47 UTC —

Removing the target version from this BZ to ensure it will be properly triaged.

— Additional comment from on 2022-01-26 21:04:15 UTC —

Deferring this to the next y-stream as the impact is that a non-critical alert is unable to fire.

— Additional comment from Sascha Grunert on 2022-03-16 10:51:23 UTC —

This may be an appropriate fix: https://github.com/kubevirt/kubevirt/pull/7372

— Additional comment from Denys Shchedrivyi on 2022-04-19 18:20:53 UTC —

On the latest build I see that initial error fixed (prometheus rule evaluation), but now I see the alert KubeVirtComponentExceedsRequestedMemory triggered right after cluster installation, screenshot attached.

— Additional comment from Denys Shchedrivyi on 2022-04-19 18:24:29 UTC —

— Additional comment from Antonio Cardace on 2022-05-09 16:00:44 UTC —

@dshchedr@redhat.com the issue reported in this bz seems to be fixed, do you see another problem after the patch? If so can you open a new bug?

— Additional comment from errata-xmlrpc on 2022-05-09 16:06:10 UTC —

This bug has been added to advisory RHEA-2022:88137 by Antonio Cardace (acardace@redhat.com)

— Additional comment from Denys Shchedrivyi on 2022-05-16 23:15:48 UTC —

Verified on CNV-v4.11.0-334

no messages about evaluating fails.

— Additional comment from Oscar Casal Sanchez on 2022-07-26 15:14:09 UTC —

Hello,

I can see that it's saying for being fixed in OCP 4.11, is not going to be backported to previous versions or should we request it now that it's verified?

Thank you so much,
Oscar

— Additional comment from Igor Bezukh on 2022-08-15 12:55:10 UTC —

Hi,

I will open a backport BZ for 4.9.6 and will backport the fix in upstream.

blocks

CNV-20461 [2118317] KubeVirtComponentExceedsRequestedMemory Prometheus Rule is Failing to Evaluate

Closed

is blocked by

CNV-15429 [2033077] KubeVirtComponentExceedsRequestedMemory Prometheus Rule is Failing to Evaluate

Closed

external trackers

Github kubevirt/kubevirt/pull/8306

Red Hat Errata Tool 100368

Red Hat Product Errata RHSA-2022:6351

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates