-
Epic
-
Resolution: Unresolved
-
Minor
-
None
-
Improve Pipeline Infrastructure Stability
-
False
-
-
False
-
To Do
-
rhel-arr-cki
Identify what leads users to retry jobs in GitLab pipelines, and resolve their causes to reduce user-visible failures.
AC:
- Provide metrics about how often jobs were retried by non-CKI users, instead of herder
- Alert when users retry jobs
- Improve alerts about jobs failing (recognize jobs failing for similar reasons and escalate to sentry/alertmanager)
- Update documentation, regarding how to convert the new alerts to the pipeline-herder rules
Jira: CKI-7126