-
Bug
-
Resolution: Done-Errata
-
Normal
-
None
Description of problem:
Received alerts from the two prometheus pods:
openshift-monitoring/prometheus-k8s-0 has failed to evaluate 10 rules in the last 5m.
openshift-monitoring/prometheus-k8s-1 has failed to evaluate 10 rules in the last 5m.
Version-Release number of selected component (if applicable):
OpenShift 4.9.10
CNV 4.9.1
How reproducible:
Unsure, but error occurs continually. This is on an upgraded cluster (4.8 -> 4.9.) Not sure if it can be reproduced on a fresh cluster
Steps to Reproduce:
1. Have cluster running latest CNV, and OpenShift v4.8.22
2. Upgrade cluster to 4.9.10
Actual results:
Cluster begins firing alerts failing to evaluate a prometheus rule.
Expected results:
Prometheus happily evaluates all the CNV alerting rules
Additional info:
The alert that is specifically failing is KubeVirtComponentExceedsRequestedMemory.
The error is:
found duplicate series for the match group
on the right hand-side of the operation: [{__name__="container_memory_usage_bytes", container="bridge-marker", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-bca86b7bfe14679147f29a0a806f04ee9f8ceb6008f5b6bd58e9be4b2f5e35e8.scope", image="registry.redhat.io/container-native-virtualization/bridge-marker@sha256:83d6f2fbf4118162aed2d2b0153b4ad39cfe3b97a3ef06e9c4fbb5e2a3aae915", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_bridge-marker_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}, {__name__="container_memory_usage_bytes", container="POD", endpoint="https-metrics", id="/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod68aaef0e_d95a_47d0_a898_d45d4d613f58.slice/crio-e346225c7c5270220cb6b2cce4de9f528c63603b2ba2c87be1e5642f0ac57b0f.scope", instance="10.42.0.102:10250", job="kubelet", metrics_path="/metrics/cadvisor", name="k8s_POD_bridge-marker-dv592_openshift-cnv_68aaef0e-d95a-47d0-a898-d45d4d613f58_0", namespace="openshift-cnv", node="node1.cloud.xana.du", pod="bridge-marker-dv592", prometheus="openshift-monitoring/k8s", service="kubelet"}];many-to-many matching not allowed: matching labels must be unique on one side
The contents of the rule:
Expression
((kube_pod_container_resource_requests
{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}) - on(pod) group_left(node) container_memory_usage_bytes{namespace="openshift-cnv"}) < 0Testing that rule in the alerting dashboard also returns the error.
NOTE: the similarly named KubeVirtComponentExceedsRequestedCPU does not appear to be failing, and is slightly different:
((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="cpu"}) - on(pod) group_left(node) node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate{namespace="openshift-cnv"}) < 0
Noting the difference after 'group_left(node)...', I tried replacing `container_memory_usage_bytes{namespace="openshift-cnv"}` with `node_namespace_pod_container:container_memory_working_set_bytes:sum_rate{namespace="openshift-cnv"}` in the rule and testing in the alerting console returns no error. So
((kube_pod_container_resource_requests{container=~"virt-controller|virt-api|virt-handler|virt-operator",namespace="openshift-cnv",resource="memory"}
) - on(pod) group_left(node) node_namespace_pod_container:container_memory_working_set_bytes:sum_rate
{namespace="openshift-cnv"}) < 0
seems to work as expected.