Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1032

[spike] Metric for produced container logs and logs collected by the collector

    • 5
    • False
    • False
    • NEW
    • NEW
    • Undefined
    • Logging (Core) - Sprint 197, Logging (Core) - Sprint 198, Logging (Core) - Sprint 199

      Story

      As an admin I can get approximate data flow metrics (bytes/second) for each forwarder input:

      • bytes_logged (counter) bytes written to log files by containers.
      • bytes_collected (counter) bytes read by the collector for forwarding.

      Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)

      The metric will be labeled so that the user can monitor total logged/collected bytes or break it down to individual log streams.

      TODO: define the labels to use.

      Note we do not require 100% accuracy, the goal is to provide average trends so that SREs can identify problems or potential problems.

      Acceptance Criteria

      • Metric data published via fluentd HTTP endpoint that is consumable by prometheus
      • Logged and Collected metrics available for each container-UUID
      • Tests for worst-case accuracy under heavy load (where large log-loss is occuring). Goal: 95% accurate?

      Notes

      The logged metric is estimated using the total size of files rotated away using fluentd's inotify watchers. Ideally this metric would be produced by CRI-O itself. It could also be estimated by an independent watcher process. Fluentd was chosen as the quickest path to a useful result because:

      • It already performs the inotify and stat calls, we just piggy-back to make the additional calculations.
      • It already has a HTTP server for publishing metrics, that is linked to prometheus.

      This method is not perfect - under heavy load it is possible that fluentd could miss some inotify events. However, file rotations do not happen frequently compared to file reads so in most conditions we should get a reasonable estimate of loss. If necessary we can add heuristics based on file timestamps to guess and estimate missed notifications.

      Open Questions

      Define the labels attached to these metrics.

      Need to review overlap of logging meta-data and observatorium labels for other cluster metrics.

       

            [LOG-1032] [spike] Metric for produced container logs and logs collected by the collector

            workflow didn't correctly change, fixing affected issues

            Clark Everson added a comment - workflow didn't correctly change, fixing affected issues

            Retitle this as a spike as the outcome of the effort resulted in knowledge gained and tasks to deliver what is needed for production. Closing as done.

            Jeffrey Cantrill added a comment - Retitle this as a spike as the outcome of the effort resulted in knowledge gained and tasks to deliver what is needed for production. Closing as done.

            What is achieved in nut-shell

            1. Functionality implementation done at in_tail.rb and fluent-prometheus-plugin.rb level
            2. New Metric showing up in prometheus dashboard
            3. Tested plugins for rotations tracking and log-loss

            • found the fluentd misses on a few rotations when maxsize of logfiles sets to < 1MB
              4. Scripts for validation written to measure ability of fluentd to correctly track totalbytes_collected and totalbytes_logged
              5. OCP validation setup - building of custom fluentd and testing of new plugin changes done

            Implementation can be found below:
            PR - https://github.com/openshift/origin-aggregated-logging/pull/2070

            What is not achieved still :
            1. 100% correct computation of log-loss as fluentd is found missing on a few rotations
            2. Current implementation has got duplication of code - needs to inherit unchanged part of plugins and reduce code redundancy using class inheritance mostly

            pratibha moogi (Inactive) added a comment - What is achieved in nut-shell 1. Functionality implementation done at in_tail.rb and fluent-prometheus-plugin.rb level 2. New Metric showing up in prometheus dashboard 3. Tested plugins for rotations tracking and log-loss found the fluentd misses on a few rotations when maxsize of logfiles sets to < 1MB 4. Scripts for validation written to measure ability of fluentd to correctly track totalbytes_collected and totalbytes_logged 5. OCP validation setup - building of custom fluentd and testing of new plugin changes done Implementation can be found below: PR - https://github.com/openshift/origin-aggregated-logging/pull/2070 What is not achieved still : 1. 100% correct computation of log-loss as fluentd is found missing on a few rotations 2. Current implementation has got duplication of code - needs to inherit unchanged part of plugins and reduce code redundancy using class inheritance mostly

            Updated the issue title to reflect the intent of this card and expected outcome

            Jeffrey Cantrill added a comment - Updated the issue title to reflect the intent of this card and expected outcome

            Alan Conway added a comment -

            Updated with clearer acceptance criteria and also made a separate story for outbound loss metrics (linked)

            They are related but should be separate for diagnostic purposes: one indicates the collector can't keep up, the other indicates the forwarder's target can't keep up. Also they can be implemented separately so better to separate them for planning.

            Alan Conway added a comment - Updated with clearer acceptance criteria and also made a separate story for outbound loss metrics (linked) They are related but should be separate for diagnostic purposes: one indicates the collector can't keep up, the other indicates the forwarder's target can't keep up. Also they can be implemented separately so better to separate them for planning.

            jcantril@redhat.com added all the above points in the Summary

            pratibha moogi (Inactive) added a comment - jcantril@redhat.com added all the above points in the Summary

            pmoogi please alter the text to define what the goal of this card and what it means to be successful:

            Summary

            • As an administrator of logging, I want XXXXX, so that XXXX
              Acceptance Criteria
              Notes

            We should also add this to the existing sprint if you are actively working on it and specify points. We can further discuss if you need guidance on pointing

            Jeffrey Cantrill added a comment - pmoogi please alter the text to define what the goal of this card and what it means to be successful: Summary As an administrator of logging, I want XXXXX, so that XXXX Acceptance Criteria Notes We should also add this to the existing sprint if you are actively working on it and specify points. We can further discuss if you need guidance on pointing

              pmoogi pratibha moogi (Inactive)
              pmoogi pratibha moogi (Inactive)
              Anping Li Anping Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: