Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-7259

Add disk and filesystem error metrics

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • Storage
    • None
    • None
    • Future Sustainability
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None

      1. Proposed title of this feature request

      Add disk and filesystem error metrics

      2. What is the nature and description of the request?

      The customer has had networking outages which lead to iSCSI multipathing disconnecting and reconnecting. Some OpenShift PVs were not correctly reconnected, leading to errors like this:

      Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672120 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
      Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236671992 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
      Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672248 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
      Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672376 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2

      Accessing the filesystems on these PVs fails with the following error message:

      [root@example-node ~]# ls -l /var/lib/kubelet/pods/607e0226-10f7-4e1b-a36a-27eb0b89e4e3/volume-subpaths/pvc-6942acd7-52e5-48f1-88ac-320f42a90313/prometheus/3
      ls: cannot access '/var/lib/kubelet/pods/607e0226-10f7-4e1b-a36a-27eb0b89e4e3/volume-subpaths/pvc-6942acd7-52e5-48f1-88ac-320f42a90313/prometheus/3': Input/output error

      A similar issue would be shown when local disks would experience filesystem issues.

      This Jira requests to add a metric (to node-exporter) to be able to alert on these kinds of issues. Upstream the following issue already exists: https://github.com/prometheus/node_exporter/issues/3005

      Ideally, an alert for these kinds of errors would also be added to the default OpenShift Container Platform monitoring.

      3. Why does the customer need this?

      At this time, there are no metrics for failing disk I/O or filesystem issues. Also, alerting via logging is currently not possible (RFE-5656). Most components rely on the underlying operating system to monitor this.

      Having this feature improves reliability of OpenShift Container Platform and allows administrators to detect when there are low-level issues with their disks or filesystems.

      4. List any affected packages or components.

      node-exporter

              rh-gs-gcharot Gregory Charot
              rhn-support-skrenger Simon Krenger
              None
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                None
                None