Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Storage
Labels:
None

Target Version:
None
Activity Type:
Future Sustainability
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:
PX Impact Range:
PX Priority Data:
PX Technical Impact:
PX Technical Impact Notes:
None
PX Scheduling Request:

1. Proposed title of this feature request

Add disk and filesystem error metrics

2. What is the nature and description of the request?

The customer has had networking outages which lead to iSCSI multipathing disconnecting and reconnecting. Some OpenShift PVs were not correctly reconnected, leading to errors like this:

Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672120 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236671992 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672248 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672376 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2

Accessing the filesystems on these PVs fails with the following error message:

[root@example-node ~]# ls -l /var/lib/kubelet/pods/607e0226-10f7-4e1b-a36a-27eb0b89e4e3/volume-subpaths/pvc-6942acd7-52e5-48f1-88ac-320f42a90313/prometheus/3
ls: cannot access '/var/lib/kubelet/pods/607e0226-10f7-4e1b-a36a-27eb0b89e4e3/volume-subpaths/pvc-6942acd7-52e5-48f1-88ac-320f42a90313/prometheus/3': Input/output error

A similar issue would be shown when local disks would experience filesystem issues.

This Jira requests to add a metric (to node-exporter) to be able to alert on these kinds of issues. Upstream the following issue already exists: https://github.com/prometheus/node_exporter/issues/3005

Ideally, an alert for these kinds of errors would also be added to the default OpenShift Container Platform monitoring.

3. Why does the customer need this?

At this time, there are no metrics for failing disk I/O or filesystem issues. Also, alerting via logging is currently not possible (RFE-5656). Most components rely on the underlying operating system to monitor this.

Having this feature improves reliability of OpenShift Container Platform and allows administrators to detect when there are low-level issues with their disks or filesystems.

4. List any affected packages or components.

node-exporter

Assignee:: Gregory Charot

Reporter:: Simon Krenger

Need Info From:: None

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/03/13 11:38 AM

Updated:: 2025/10/10 1:27 AM

Target start:: None

Target end:: None

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates