-
Bug
-
Resolution: Done
-
Normal
-
None
-
Quality / Stability / Reliability
-
1
-
False
-
-
False
-
NEW
-
VERIFIED
-
With this fix users will be able to create RecordingRules or AlertingRules in both Infrastructure and/or Audit without having to specify a namespace label
-
Bug Fix
-
-
-
Log Storage - Sprint 273, Logging - Sprint 274
-
Important
Problem statement
To detect and address storage issues, a customer would like to alert on kernel messages indicating that there are I/O errors, such as the following:
Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236672120 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2 Mar 02 02:12:33 example-node.example.com kernel: I/O error, dev sde, sector 236671992 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2
These messages are also visible via Loki / Infrastructure Logging:
{"@timestamp":"2025-03-12T12:11:40.169048Z","_RUNTIME_SCOPE":"system","_SOURCE_MONOTONIC_TIMESTAMP":"1012091349","hostname":"ip-10-0-2-224","kubernetes":{"container_name":"","namespace_name":"","pod_name":""},"level":"warning","log_source":"node","log_type":"infrastructure","message":"I/O error, dev sde, sector 236672120 op 0x1:(WRITE) flags 0x4a00 phys_seg 16 prio class 2","openshift":{"cluster_id":"41a17697-985f-4fc8-afa9-434482937887","sequence":1741781500535134264},"systemd":{"t":{"BOOT_ID":"ac3018b64b784cedbf332ee684d813b4","MACHINE_ID":"ec248febe93c4d59715b5326628d3475","TRANSPORT":"kernel"},"u":{"SYSLOG_FACILITY":"1","SYSLOG_IDENTIFIER":"kernel"}},"time":"2025-03-12T12:11:40+00:00"}
However, when trying to create an AlertingRule for this message to alert, this fails:
kind: AlertingRule apiVersion: loki.grafana.com/v1 metadata: name: kernel-io-errors namespace: openshift-logging labels: openshift.io/log-alerting: 'true' spec: groups: - interval: 1m name: KernelErrors rules: - alert: KernelIOErrors annotations: summary: Kernel log is showing I/O errors, this is potentially related to iSCSI connectivity issues or other storage issues description: Kernel log is showing I/O errors message: '{{ $labels.message }}' expr: 'count_over_time({ log_type="infrastructure" } |~ `I/O error` | json [60m]) > 0' labels: severity: critical for: 0m tenantID: infrastructure
This fails with the following error message:
$ oc apply -f rule.yml [..] spec.groups[0].rules[0].expr: Invalid value: "count_over_time({ log_type=\"infrastructure\" } |~ `I/O error` | json [60m]) > 0": rule needs to have a matcher for the namespace
I believe this to be a Bug, even though this issue has already been discussed in RFE-5656.
Affected versions
- OpenShift Container Platform 4.18.1
- OpenShift Logging 6.2.0
- Cluster Observability Operator 1.0.0
- Loki Operator 6.2.0
Steps to reproduce
1. Set up the complete logging stack with Loki on OpenShift Container Platform 4.18 as per the quick start guide in the documentation
2. Use "oc rsh" and "chroot /host" to run commands on a host. Generate a kernel message using "echo 'kernel: test message' > /dev/kmsg"
3. Try to create an AlertingRule as per the definition above to filter for "test message"
4. Observe that the alerting rule on Kernel messages cannot be created
Additional information
- This has also been discussed in RFE-5656
- Slack discussion: https://redhat-internal.slack.com/archives/CB3HXM2QK/p1741271929894219
- clones
-
LOG-6862 Cannot create AlertingRule for infrastructure logs without specifying the namespace
-
- Closed
-
- links to