Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: rhos-18.0.7
Affects Version/s: rhos-18.0 FR 2 (Mar 2025)
Component/s: openstack-watcher
Labels:
None

Story Points:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
openstack-watcher-10.0.1-18.0.20250324164829.c014f81.el9ost
Gerrit Link:
https://review.opendev.org/c/openstack/watcher/+/944795
Regression:
None
Release Note Type:
Release Note Not Required
Release Note Status:
Done
Test Coverage:

New Test Coverage
Git Pull Request:
https://gitlab.cee.redhat.com/eng/openstack/openstack-watcher/-/merge_requests/10
Intelligence Requested:
Market:
Release Blocker:
Rejected
Errata Link:
https://errata.engineering.redhat.com/advisory/147941
Target Version:

rhos-18.0.7

Sprint:
Workload Evolution Sprint 1, Workload Evolution Sprint 2
sprint_count:
2
Severity:
Critical

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:

Deploy RHOSO with at least two compute nodes and telemetry enabled.
Deploy Watcher
Create several VMs and make sure all of them run in the same compute node.
Create an ongoing audit with the goal Workload Balancing and the Strategy Workload stabilization and create high load in the instances in one of the compute nodes.
The audit creates just empty actionplans.

Expected behavior

The audit should create an action plan to move the VM with high usage to an empty node.

Found Behavior

Watcher will fail to execute any audit with a strategy which require host metrics.

Known workaround

No workaround

Additional context

After adding podman_exporter and network_exporter to telemetry in https://github.com/openstack-k8s-operators/telemetry-operator/pull/627 and and https://github.com/openstack-k8s-operators/telemetry-operator/pull/598, there are more that one target in prometheus which have the same value for the `fqdn` label. In this case there are one for node_exporter, one for podman_exporter and one for network_exporter.
Watcher prometheus datasource list all the targets with lable fqdn=<compute_node> and uses the latest one for queries. Latest one may not be the node_exporter one.
Watcher makes queries for node_exporter metrics to podman_exporter or network_exporter metrics which return empty value.

Example logs:

2025-03-17 14:00:02.496 1 DEBUG observabilityclient.prometheus_client [None req-9bbd6640-10da-46fa-9aaa-d947fcff5f4f - - - - - -] Querying prometheus with query: 100 - (avg by (instance)(rate(node_cpu_seconds_total{mode='idle',instance='192.168.122.102:9882'}[600s])) * 100) query

Note port 9882 is the podman exporter.

As per conversation with cloudops team, having fqdn label in all the targets running in a compute node is the expected behaviour in order to easily identify it, and we should not expect fqdn to be useful to identify targets for a specific host and exporter type.

links to

https://bugs.launchpad.net/watcher/+bug/2103451

RHBA-2025:147941 Release of components for RHOSO 18.0

mentioned on

Merge request - Aggregate by fqdn label instead instance in host cpu metrics

Merge request - Query by fqdn_label instead of instance for host metrics

Assignee:: Alfredo Moralejo Alonso

Reporter:: Alfredo Moralejo Alonso

Team:: rhos-workloads-evolution

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2025/03/17 4:01 PM

Updated:: 2025/04/23 1:02 PM

Resolved:: 2025/04/23 1:02 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty