Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-7317

[release-6.2] Cannot create AlertingRule for infrastructure logs without specifying the namespace

XMLWordPrintable

    • Quality / Stability / Reliability
    • 1
    • False
    • Hide

      None

      Show
      None
    • False
    • NEW
    • VERIFIED
    • With this fix users will be able to create RecordingRules or AlertingRules in both Infrastructure and/or Audit without having to specify a namespace label
    • Bug Fix
    • Log Storage - Sprint 273, Logging - Sprint 274
    • Important

      Problem statement

      To detect and address storage issues, a customer would like to alert on kernel messages indicating that there are I/O errors, such as the following:

      Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672120 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
      Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236671992 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2

      These messages are also visible via Loki / Infrastructure Logging:

      {"@timestamp":"2025-03-12T12:11:40.169048Z","_RUNTIME_SCOPE":"system","_SOURCE_MONOTONIC_TIMESTAMP":"1012091349","hostname":"ip-10-0-2-224","kubernetes":{"container_name":"","namespace_name":"","pod_name":""},"level":"warning","log_source":"node","log_type":"infrastructure","message":"I/O error, dev sde, sector 236672120 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2","openshift":{"cluster_id":"41a17697-985f-4fc8-afa9-434482937887","sequence":1741781500535134264},"systemd":{"t":{"BOOT_ID":"ac3018b64b784cedbf332ee684d813b4","MACHINE_ID":"ec248febe93c4d59715b5326628d3475","TRANSPORT":"kernel"},"u":{"SYSLOG_FACILITY":"1","SYSLOG_IDENTIFIER":"kernel"}},"time":"2025-03-12T12:11:40+00:00"}

      However, when trying to create an AlertingRule for this message to alert, this fails:

      kind: AlertingRule
      apiVersion: loki.grafana.com/v1
      metadata:
        name: kernel-io-errors
        namespace: openshift-logging
        labels:
          openshift.io/log-alerting: 'true'
      spec:
        groups:
          - interval: 1m
            name: KernelErrors
            rules:
              - alert: KernelIOErrors
                annotations:
                  summary: Kernel log is showing I/O errors, this is potentially related to iSCSI connectivity issues or other storage issues
                  description: Kernel log is showing I/O errors
                  message: '{{ $labels.message }}'
                expr: 'count_over_time({ log_type="infrastructure" } |~ `I/O error` | json [60m]) > 0'
                labels:
                  severity: critical
                for: 0m
        tenantID: infrastructure

      This fails with the following error message:

      $ oc apply -f rule.yml
      [..]
      spec.groups[0].rules[0].expr: Invalid value: "count_over_time({ log_type=\"infrastructure\" } |~ `I/O error` | json [60m]) > 0": rule needs to have a matcher for the namespace

      I believe this to be a Bug, even though this issue has already been discussed in RFE-5656.

      Affected versions

      • OpenShift Container Platform 4.18.1
      • OpenShift Logging 6.2.0
      • Cluster Observability Operator 1.0.0
      • Loki Operator 6.2.0

      Steps to reproduce

      1. Set up the complete logging stack with Loki on OpenShift Container Platform 4.18 as per the quick start guide in the documentation
      2. Use "oc rsh" and "chroot /host" to run commands on a host. Generate a kernel message using "echo 'kernel: test message' > /dev/kmsg"
      3. Try to create an AlertingRule as per the definition above to filter for "test message"
      4. Observe that the alerting rule on Kernel messages cannot be created

      Additional information

              rhn-support-kbharti Kabir Bharti
              rhn-support-skrenger Simon Krenger
              Joao Marcal
              Kabir Bharti Kabir Bharti
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: