Uploaded image for project: 'Managed Service - Streams'
  1. Managed Service - Streams
  2. MGDSTRM-8345

Create an alert and SOP for fluentd failure

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • 5
    • False
    • None
    • False
    • No
    • ---
    • ---
    • MK - Sprint 219

      WHAT

      <What is being asked for?>

      Fluentd can get stuck when rolling all pods we should alert of a fluentd pod stuck and have a SOP for resolving the problem.

      WHY

      <Why is this task being done?>

      HOW

      <Suggestions for how this may be solved.> [Optional]

      Federate the kube_pod_container_ready metric for the openshift-logging namespace here https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/federation-config.yaml#L13

      Then use that metric to alert for a failed pod kube_pod_container_status_ready{namespace="openshift-logging", container="fluentd"} < 1

      DONE

      Include the following where applicable:

      • <bulleted list of functional acceptance criteria that need to be completed>
      • <call out anything on the documentation side that's needed as a result of this task being completed>
      • <any metrics, monitoring dashboards and alerts that need to be created or be updated>
      • <SOP creation or updates>

      Guidelines

      The following steps should be adhered to:

      • Required tests should be put in place - unit, integration, manual test cases (if necessary)
      • CI and all relevant tests passing
      • Changes have been verified by one additional reviewer against:
      • each required environment
      • each supported upgrade path
      • If the changes could have an impact on the clients (either UI or CLI), a JIRA should be created for making the required changes on the client side and acknowledged by one of the client side team members. PR has been merged
         

            stobin1@redhat.com Steven Tobin
            stobin1@redhat.com Steven Tobin
            MK - Running the Service
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: