-
Feature Request
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
-
None
-
None
-
None
-
None
-
None
1. Proposed title of this feature request
RHACS: Expose Administrative Events as Prometheus Metrics for External Alerting
2. What is the nature and description of the request?
Red Hat Advanced Cluster Security (RHACS) generates "Administrative Events" to log the health and status of the RHACS system itself. These events include critical errors suchas "image scan errors," database issues, or sensor communication failures.
Currently, these events are only visible within the RHACS UI and are not exposed as metrics. This RFE requests that RHACS expose these administrative events, particularly those with a level of ERROR or WARN, as Prometheus metrics.
A new metric, for example, rox_central_administrative_events_total, could be introduced. This metric should use labels to differentiate the events, such as:
- type: (e.g., image_scan_error, db_error, sensor_comm_failure)
- level: (e.g., info, warn, error)
This would allow external Prometheus instances to scrape these metrics and build alerts based on them.
An Alternative Solution might be to have some Integration which alerts for administrative events like there is for policy violations.
3. Why does the customer need this? (List the business requirements here)
- Proactive Incident Management: The customer's primary requirement is to automatically generate investigation tickets when critical system errors occur. They cannot rely on operators manually monitoring the "Administrative Events" UI page.
- Ensure Platform Health and Reliability: Failures logged in Administrative Events (like scan errors) represent a silent failure of the security platform. If new images are not being scanned, the organization is exposed to unknown vulnerabilities. Proactive alerting on these failures is critical to maintaining the platform's reliability and security posture.
- Reduce Mean Time to Detection (MTTD): Without metrics, a critical error might go unnoticed for hours or days. Automated alerts based on metrics reduce the detection time to minutes, allowing the platform team to investigate and remediate issues before they cause a significant security gap.
4. List any affected packages or components
- Metrics Endpoint
- RHACS Central