-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
5
-
False
-
False
-
NEW
-
NEW
-
Undefined
-
-
Logging (Core) - Sprint 197, Logging (Core) - Sprint 198, Logging (Core) - Sprint 199
Story
As an admin I can get approximate data flow metrics (bytes/second) for each forwarder input:
- bytes_logged (counter) bytes written to log files by containers.
- bytes_collected (counter) bytes read by the collector for forwarding.
Log loss can be computed by a prometheus query expression: (bytes_logged - bytes_collected)
The metric will be labeled so that the user can monitor total logged/collected bytes or break it down to individual log streams.
TODO: define the labels to use.
Note we do not require 100% accuracy, the goal is to provide average trends so that SREs can identify problems or potential problems.
Acceptance Criteria
- Metric data published via fluentd HTTP endpoint that is consumable by prometheus
- Logged and Collected metrics available for each container-UUID
- Tests for worst-case accuracy under heavy load (where large log-loss is occuring). Goal: 95% accurate?
Notes
The logged metric is estimated using the total size of files rotated away using fluentd's inotify watchers. Ideally this metric would be produced by CRI-O itself. It could also be estimated by an independent watcher process. Fluentd was chosen as the quickest path to a useful result because:
- It already performs the inotify and stat calls, we just piggy-back to make the additional calculations.
- It already has a HTTP server for publishing metrics, that is linked to prometheus.
This method is not perfect - under heavy load it is possible that fluentd could miss some inotify events. However, file rotations do not happen frequently compared to file reads so in most conditions we should get a reasonable estimate of loss. If necessary we can add heuristics based on file timestamps to guess and estimate missed notifications.
Open Questions
Define the labels attached to these metrics.
Need to review overlap of logging meta-data and observatorium labels for other cluster metrics.