Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-3432

Vector Alerts CollectorHighErrorRate and CollectorVeryHighErrorRate are not firing even if Vector component errors are high

XMLWordPrintable

    • False
    • None
    • False
    • NEW
    • NEW
    • The CollectorHighErrorRate and CollectorVeryHighErrorRate are removed in the logging 6.0 release and may be re-introduced in a future release
    • Bug Fix
    • Log Collection - Sprint 235, Log Collection - Sprint 236, Log Collection - Sprint 237, Log Collection - Sprint 238, Log Collection - Sprint 239, Log Collection - Sprint 240, Log Collection - Sprint 241, Log Collection - Sprint 242, Log Collection - Sprint 243, Log Collection - Sprint 244, Log Collection - Sprint 245, Log Collection - Sprint 258

      Description of problem:

      Alerts CollectorHighErrorRate and CollectorVeryHighErrorRate are not firing even if Vector component errors are high. 

      Version-Release number of selected component (if applicable):

      cluster-logging.v5.6.0

      elasticsearch-operator.v5.6.0

      How reproducible:

      Always

      Steps to Reproduce:

      *Create a ClusterLogging instance with Vector as collector and ES as default logstore.

      *Delete ES pods to generate Vector errors. 

      while true; do oc delete pod $( oc get pod -l component=elasticsearch | egrep elasticsearch | awk '{print $1}' ); done

      *Wait for 15 - 20 minutes and check vector_component_errors_total and check that the alerts CollectorHighErrorRate and CollectorVeryHighErrorRate are not triggered. 

      sum(irate(vector_component_errors_total[2m]))
      1.6332200316965575
      

      Other steps tried to generate the errors.

      *Create a ClusterLogForwarder instance to send logs to a non-existent external ES instance.

      apiVersion: logging.openshift.io/v1
      kind: ClusterLogForwarder
      metadata:
        name: instance
        namespace: openshift-logging
      spec:
        outputs:
        - name: es-created-by-user
          type: elasticsearch
          url: 'http://elasticsearch-server.aosqe-es.svc:9200'
        pipelines:
        - name: forward-to-external-es
          inputRefs:
          - infrastructure
          - application
          - audit
          outputRefs:
          - es-created-by-user

      *Create a ClusterLogging instance.

      apiVersion: "logging.openshift.io/v1"
      kind: "ClusterLogging"
      metadata:
        name: "instance" 
        namespace: "openshift-logging"
      spec:
        managementState: "Managed"  
        collection:
          logs:
            type: "vector"  
            vector: {}

      *Wait for 15 - 20 minutes and check vector_component_errors_total and check that the alerts CollectorHighErrorRate and CollectorVeryHighErrorRate are not triggered. 

      Additional info:

      The Metrics expression doesn't return any value.

      100 * (sum by(pod, instance) (rate(vector_component_errors_total[2m])) / sum by(pod, instance) (rate(vector_component_received_events_total[2m]))) > 10

      This issue is also reported  for FluentdHighErrorRate and FluentdVeryHighErrorRate 

      https://issues.redhat.com/browse/LOG-2457

      It seems we cant triggers these alerts reliably or we are simulating the scenario wrong. 

              jcantril@redhat.com Jeffrey Cantrill
              rhn-support-ikanse Ishwar Kanse
              Anping Li Anping Li
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: