Loading...

XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Critical
Fix Version/s: rhos-18.0 FR 2 (Mar 2025)
Affects Version/s: None
Component/s: telemetry-operator
Labels:
None

Epic Name:
detect downtime
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
OBSDA-574Container health feature
Dev Approval:
Committed
Docs Approval:
Committed
Epic Status:
To Do
Feature Link:
OBSDA-574 - Container health feature
PM Approval:
Committed
QE Approval:
Committed
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:

Hide
.Improved metrics for RHOSO Observability

You can now use new metrics for monitoring the health of RHOSO services, including the following:

* `kube_pod_status_phase`
* `kube_pod_status_ready`
* `node_systemd_unit_state`
* `podman_container_state`
* `podman_container_health`

You can use the `kube_pod_status_phase` and `kube_pod_status_ready` to monitor control plane services.

* `kube_pod_status_phase` - The relevant parameter is `Phase`, with values of Pending, Running, Succeeded, Failed, or Unknown, and corresponding Boolean values of `1` or `0`.

* `kube_pod_status_ready` - This metric also has Boolean values, with `1` indicating that the pod has all the containers running and readiness probes succeeding, and `0` indicating that the pod has not all the containers running or that the readiness probe did not succeed.

You can use the `node_systemd_unit_state` to monitor the running state of data plane services.

* `node_systemd_unit_state ` - The relevant parameter is `State`, with values of activating, active, deactivating, failed, inactive, and corresponding Boolean values of `1` or `0`.

You can use the `podman_container_state` and `podman_container_health` to monitor the health of data plane containerized services.

* `podman_container_state` - This metric can have the following values: -1=unknown, 0=created, 1=initialized, 2=running, 3=stopped, 4=paused, 5=exited, 6=removing, 7=stopping.

* `podman_container_health` - This metric can have the following values: -1=unknown, 0=healthy, 1=unhealthy, 2=starting.

Show
.Improved metrics for RHOSO Observability You can now use new metrics for monitoring the health of RHOSO services, including the following: * `kube_pod_status_phase` * `kube_pod_status_ready` * `node_systemd_unit_state` * `podman_container_state` * `podman_container_health` You can use the `kube_pod_status_phase` and `kube_pod_status_ready` to monitor control plane services. * `kube_pod_status_phase` - The relevant parameter is `Phase`, with values of Pending, Running, Succeeded, Failed, or Unknown, and corresponding Boolean values of `1` or `0`. * `kube_pod_status_ready` - This metric also has Boolean values, with `1` indicating that the pod has all the containers running and readiness probes succeeding, and `0` indicating that the pod has not all the containers running or that the readiness probe did not succeed. You can use the `node_systemd_unit_state` to monitor the running state of data plane services. * `node_systemd_unit_state ` - The relevant parameter is `State`, with values of activating, active, deactivating, failed, inactive, and corresponding Boolean values of `1` or `0`. You can use the `podman_container_state` and `podman_container_health` to monitor the health of data plane containerized services. * `podman_container_state` - This metric can have the following values: -1=unknown, 0=created, 1=initialized, 2=running, 3=stopped, 4=paused, 5=exited, 6=removing, 7=stopping. * `podman_container_health` - This metric can have the following values: -1=unknown, 0=healthy, 1=unhealthy, 2=starting.
Release Note Type:
Feature
Release Note Status:
Done
Test Coverage:

Proposed
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Sensubility used to do this in pre-18 environments, but it has been removed from the OSP18 release so we cannot rely on it anymore.

We need to detect and generate some kind of metric for when a Service is not responding.

This has two sides with very different particularities, which means we will mostly for sure need two separate solutions:

Control Plane: Try to use OpenShift/Kubernetes features to achieve this
Compute nodes: It seems that maybe a dedicated exporter might be needed

is triggered by

OSPRH-2961 Remove sensubility from distribution

Closed

links to

openstack-k8s-operators/edpm-ansible#774: Healthchecks

openstack-k8s-operators/edpm-ansible#779: Healthchecks resubmit

openstack-k8s-operators/edpm-ansible#789: Add prometheus-podman-exporter deployment

openstack-k8s-operators/openstack-operator#1056: Add support for KSM

openstack-k8s-operators/tcib#217: Add socat for iscsid

mentioned in: Page Loading...

(1 links to, 1 mentioned in)

Assignee:: Martin Magr

Reporter:: Juan Larriba

Team:: rhos-conplat-observability

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2023/11/14 2:23 PM

Updated:: 2025/06/11 7:25 PM

Resolved:: 2025/03/24 1:15 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty