Details
Description
MON-1791 concluded that it's ok to use the Insights operator to gather more data. Right now, the operator can collect additional data when a particular alert fires (seeĀ https://github.com/openshift/insights-operator/blob/2b6697e230b098207dc09e5b05ea655ced1cb881/pkg/gatherers/conditional/conditional_gatherer.go#L61-L96). A current example is that Insights would collect the API request count resources when the "APIRemovedInNextEUSReleaseInUse" alert is firing to identify which clients are using API versions that are going to removed in the next EUS release.
A few examples that come to mind for monitoring are
- Collect PVC resources when KubePersistentVolumeFillingUp fires with critical severity.
- Collect the list of down targets from /api/v1/targets when PrometheusTargetSyncFailure fires.
For each critical alert that can be fired by the monitoring stack, we should identify additional data that could be collected to help the diagnostic.
DoD:
- Jira tickets created for all critical alerts.