Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: ACM 2.9.5
Affects Version/s: ACM 2.9.0
Component/s: Cluster Lifecycle, Server Foundation
Labels:
- doc-required

Story Points:
2
Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
SF Ready/Refined Backlog, SF Train-19
Severity:
Moderate

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

PX Impact Score:

Description of problem:

If a one or more managed clusters, has a label starting with a digit, clusterlifecycle-state-metrics produces an output which is invalid according to the Prometheus exposition format. This results in Prometheus being unable to correctly scrape the target. This is because label names in Prometheus are not allowed to start with a digit (see the label name regex on that doc page).

This can cause problems down the line for Observability, as the Grafana dashboards depends on the `acm_managed_cluster_labels` being present.

Version-Release number of selected component (if applicable):

ACM 2.9 (customer version), ACM 2.11 (I've tested on this version) - I assume all ACM version are probably affected.

How reproducible:

Always

Steps to Reproduce:

Add a label to a managed cluster that starts with a digit for example as below where a label `5g-test` is added:

❯ oc get managedcluster local-cluster -o yaml
apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
....
  generation: 4
  labels:
    5g-test: "true"
    cloud: Amazon
......

Now, on the hub go to the OpenShift console "Observe" -> "Targets" and notice that the clusterlifecycle-state-metrics-v2 is down
See the screenshot for the error to be displayed

Actual results:

Prometheus is unable to scrape the clusterlifecycle-state-metrics target, causing the Observability Grafana dashboards to be completely empty, as they depend on a metric from this component.

Expected results:

Ideally: the label is renamed or completely omitted from the clusterlifecycle-state-metrics-v2 output, so that no disruption are caused later down the line. A warning message should be present in the clusterlifecycle-state-metrics-v2 logs, so that it's easy figure out why an expected label might be missing or has been renamed.
Alternatively: either make the pod fail completely when the output is invalid (perhaps a health check could validate that output is valid using promtool or similar), alternatively at least there should be some error message in the pods log so it's possible to understand what is going on.
Additional info:

Example use of promtool:

❯ oc exec -it -n openshift-monitoring prometheus-k8s-0 -c prometheus -- /bin/bash

bash-5.1$ curl -s -k https://clusterlifecycle-state-metrics-v2.multicluster-engine.svc.cluster.local:8443/metrics | promtool check metrics

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

target_error.png
641 kB
2024/09/13 9:39 AM

causes

ACM-14149 Grafana is not getting data after object store disruption and resolution

Closed

is cloned by

ACM-14482 clusterlifecycle-state-metrics procudes invalid output when MC has labels starting with a digit

Testing

ACM-14483 clusterlifecycle-state-metrics procudes invalid output when MC has labels starting with a digit

Testing

Assignee:: Qing Hao

Reporter:: Jacob Baungard Hansen

QA Contact:: Hui Chen

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2024/09/13 9:53 AM

Updated:: 2024/10/18 8:16 AM

Resolved:: 2024/10/18 12:14 AM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates