-
Bug
-
Resolution: Done-Errata
-
Undefined
-
rhos-18.0 FR 2 (Mar 2025)
-
None
-
2
-
False
-
-
False
-
?
-
openstack-watcher-10.0.1-18.0.20250324164829.c014f81.el9ost
-
None
-
Release Note Not Required
-
Done
-
New Test Coverage
-
-
-
Rejected
-
Workload Evolution Sprint 1, Workload Evolution Sprint 2
-
2
-
Critical
To Reproduce Steps to reproduce the behavior:
- Deploy RHOSO with at least two compute nodes and telemetry enabled.
- Deploy Watcher
- Create several VMs and make sure all of them run in the same compute node.
- Create an ongoing audit with the goal Workload Balancing and the Strategy Workload stabilization and create high load in the instances in one of the compute nodes.
- The audit creates just empty actionplans.
Expected behavior
- The audit should create an action plan to move the VM with high usage to an empty node.
Found Behavior
- Watcher will fail to execute any audit with a strategy which require host metrics.
Known workaround
- No workaround
Additional context
- After adding podman_exporter and network_exporter to telemetry in https://github.com/openstack-k8s-operators/telemetry-operator/pull/627 and and https://github.com/openstack-k8s-operators/telemetry-operator/pull/598, there are more that one target in prometheus which have the same value for the `fqdn` label. In this case there are one for node_exporter, one for podman_exporter and one for network_exporter.
- Watcher prometheus datasource list all the targets with lable fqdn=<compute_node> and uses the latest one for queries. Latest one may not be the node_exporter one.
- Watcher makes queries for node_exporter metrics to podman_exporter or network_exporter metrics which return empty value.
Example logs:
2025-03-17 14:00:02.496 1 DEBUG observabilityclient.prometheus_client [None req-9bbd6640-10da-46fa-9aaa-d947fcff5f4f - - - - - -] Querying prometheus with query: 100 - (avg by (instance)(rate(node_cpu_seconds_total{mode='idle',instance='192.168.122.102:9882'}[600s])) * 100) query
Note port 9882 is the podman exporter.
As per conversation with cloudops team, having fqdn label in all the targets running in a compute node is the expected behaviour in order to easily identify it, and we should not expect fqdn to be useful to identify targets for a specific host and exporter type.
- links to
-
RHBA-2025:147941 Release of components for RHOSO 18.0
- mentioned on