Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: CNV v4.11.0
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- cnv-4+
- cnvbugsm
- devel_ack+
- pm_ack+
- qa_ack+
- qe_test_coverage?

Blocked:
False
Ready:
False
BZ Status:
CLOSED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2033077
Bugzilla Bug:
RHBZ: 2033077

Severity:
Medium

Regression:
No

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

Received alerts from the two prometheus pods:
openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m.

openshift-monitoring/prometheus-k8s-1 has failed to evaluate 10 rules in the last 5m.

Version-Release number of selected component (if applicable):
OpenShift 4.9.10
CNV 4.9.1

How reproducible:
Unsure, but error occurs continually. This is on an upgraded cluster (4.8 -> 4.9.) Not sure if it can be reproduced on a fresh cluster

Steps to Reproduce:
1. Have cluster running latest CNV, and OpenShift v4.8.22
2. Upgrade cluster to 4.9.10

Actual results:
Cluster begins firing alerts failing to evaluate a prometheus rule.

Expected results:
Prometheus happily evaluates all the CNV alerting rules

Additional info:
The alert that is specifically failing is KubeVirtComponentExceedsRequestedMemory.

The error is:
found duplicate series for the match group

{pod="bridge-marker-dv592"}

on the right hand-side of the operation: [{__name__="container_memory_usage_bytes", container="bridge-marker", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-bca86b7bfe14679147f29a0a806f04ee9f8ceb6008f5b6bd58e9be4b2f5e35e8.scope", image="registry.redhat.io/container-native-virtualization/bridge-marker@sha256:83d6f2fbf4118162aed2d2b0153b4ad39cfe3b97a3ef06e9c4fbb5e2a3aae915", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_bridge-marker_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}, {__name__="container_memory_usage_bytes", container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-e346225c7c5270220cb6b2cce4de9f528c63603b2ba2c87be1e5642f0ac57b0f.scope", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side

The contents of the rule:
Expression

((kube_pod_container_resource_requests

{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}) - on(pod) group_left(node) container_memory_usage_bytes{namespace="openshift-cnv"}) < 0

Testing that rule in the alerting dashboard also returns the error.

NOTE: the similarly named KubeVirtComponentExceedsRequestedCPU does not appear to be failing, and is slightly different:

((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="cpu"}) - on(pod) group_left(node) node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{namespace="openshift-cnv"}) < 0

Noting the difference after 'group_left(node)...', I tried replacing `container_memory_usage_bytes{namespace="openshift-cnv"}` with `node_namespace_pod_container:container_memory_working_set_bytes:sum_rate{namespace="openshift-cnv"}` in the rule and testing in the alerting console returns no error. So

((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}

) - on(pod) group_left(node) node_namespace_pod_container:container_memory_working_set_bytes:sum_rate

{namespace="openshift-cnv"}

) < 0

seems to work as expected.

blocks

CNV-20461 [2118317] KubeVirtComponentExceedsRequestedMemory Prometheus Rule is Failing to Evaluate