Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1098

Ensure Critical alerts have complete playbook entries

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • Logging 5.1
    • None
    • Log Storage
    • None
    • 3
    • False
    • False
    • NEW
    • NEW
    • Undefined
    • Logging (LogExp) - Sprint 200

      Provide copy/paste commands to issue

      easy to follow checks/next steps (think needing to wake up in the middle of the night to resolve)

       

      Current upstream doc is here: https://github.com/openshift/elasticsearch-operator/blob/master/docs/alerts.md 

       

      (Currently we are focusing on Critical in this task to make this more bite-sized – we will have follow up for warning and info as well)

      Acceptance Criteria:

      • Ensure that any alerts that are marked as critical have proper diagnostic steps and action steps so that an user can resolve the alert
      • Possibly also ensure that our alerts are using the run{{book_url}} variable to point to our playbook location (tbd if this will be upstream or downstream)

            [LOG-1098] Ensure Critical alerts have complete playbook entries

            Errata Tool added a comment -

            This issue has been addressed in the following products:

            OpenShift Logging 5.1

            Via RHBA-2021:2112 https://access.redhat.com/errata/RHBA-2021:2112

            Errata Tool added a comment - This issue has been addressed in the following products: OpenShift Logging 5.1 Via RHBA-2021 :2112 https://access.redhat.com/errata/RHBA-2021:2112

            The new changes lgtm.

            Qiaoling Tang added a comment - The new changes lgtm.

            Hi sasagarw@redhat.com,

            I did some testing based on the changes in https://github.com/openshift/elasticsearch-operator/pull/673/files , and found some issues about lowering the number of replicas:

            1.  Executing ` oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> – es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.number_of_replicas":<number_of_replicas>}'` won't take effect, the EO will always rewrite the setting.

            2. When `Elasticsearch Node Disk Flood Watermark Reached`, executing command `oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> – es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.number_of_replicas":<number_of_replicas>}'` will get below error:

             

            {
              "error" : {
                "root_cause" : [
                  {
                    "type" : "cluster_block_exception",
                    "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
                  }
                ],
                "type" : "cluster_block_exception",
                "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];"
              },
              "status" : 403
            }
            

             

             

            Qiaoling Tang added a comment - Hi sasagarw@redhat.com , I did some testing based on the changes in https://github.com/openshift/elasticsearch-operator/pull/673/files  , and found some issues about lowering the number of replicas: 1.  Executing ` oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> – es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.number_of_replicas":<number_of_replicas>}'` won't take effect, the EO will always rewrite the setting. 2. When `Elasticsearch Node Disk Flood Watermark Reached`, executing command `oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> – es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.number_of_replicas":<number_of_replicas>}'` will get below error:   { "error" : { "root_cause" : [ { "type" : "cluster_block_exception" , "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];" } ], "type" : "cluster_block_exception" , "reason" : "blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];" }, "status" : 403 }    

            rdlugyhe that sounds good to me, i think we should also have a line item that is specific for sasagarw@redhat.com which is to open a PR against our EO repo to update the current alert guide with his content (it can be based on your edits). I think it would be useful to have it both in an upstream and downstream location.

            Eric Wolinetz (Inactive) added a comment - rdlugyhe  that sounds good to me, i think we should also have a line item that is specific for sasagarw@redhat.com  which is to open a PR against our EO repo to update the current alert guide with his content (it can be based on your edits). I think it would be useful to have it both in an upstream and downstream location.

            Rolfe Dlugy-Hegwer added a comment - - edited

            sasagarw@redhat.com ewolinet@redhat.com: How do these steps forward look to you?

            • Sashank composes a first draft of the diagnostic and troubleshooting steps in the Google doc.
            • Rolfe edits the content.
            • QE verifies the content.
            • Rolfe converts and publishes the content as topics in the OpenShift docs. 

            Does this meet the needs of the design you are working on?

            Rolfe Dlugy-Hegwer added a comment - - edited sasagarw@redhat.com ewolinet@redhat.com : How do these steps forward look to you? Sashank composes a first draft of the diagnostic and troubleshooting steps in the Google doc. Rolfe edits the content. QE verifies the content. Rolfe converts and publishes the content as topics in the OpenShift docs.  Does this meet the needs of the design you are working on?

            rdlugyhe how should we go about this? Can you take a look at it and let me know? Thanks.

            Sashank Agarwal (Inactive) added a comment - rdlugyhe  how should we go about this? Can you take a look at it and let me know? Thanks.

            Sashank Agarwal (Inactive) added a comment - Link to the document: https://docs.google.com/document/d/1EJjqixIxEPLf5pWQcqWqyYfRgvsDktSzBDCc6JcxrYU/edit?usp=sharing

              sasagarw@redhat.com Sashank Agarwal (Inactive)
              ewolinet@redhat.com Eric Wolinetz (Inactive)
              Qiaoling Tang Qiaoling Tang
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: