-
Task
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
5
-
False
-
None
-
False
-
No
-
---
-
---
-
MK - Sprint 219
WHAT
<What is being asked for?>
Fluentd can get stuck when rolling all pods we should alert of a fluentd pod stuck and have a SOP for resolving the problem.
WHY
<Why is this task being done?>
HOW
<Suggestions for how this may be solved.> [Optional]
Federate the kube_pod_container_ready metric for the openshift-logging namespace here https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/federation-config.yaml#L13
Then use that metric to alert for a failed pod kube_pod_container_status_ready{namespace="openshift-logging", container="fluentd"} < 1
DONE
Include the following where applicable:
- <bulleted list of functional acceptance criteria that need to be completed>
- <call out anything on the documentation side that's needed as a result of this task being completed>
- <any metrics, monitoring dashboards and alerts that need to be created or be updated>
- <SOP creation or updates>
Guidelines
The following steps should be adhered to:
- Required tests should be put in place - unit, integration, manual test cases (if necessary)
- CI and all relevant tests passing
- Changes have been verified by one additional reviewer against:
- each required environment
- each supported upgrade path
- If the changes could have an impact on the clients (either UI or CLI), a JIRA should be created for making the required changes on the client side and acknowledged by one of the client side team members. PR has been merged
- links to