Uploaded image for project: 'Observability Documentation'
  1. Observability Documentation
  2. OBSDOCS-479

[DOC] Flow control mechanisms for more predictable log collection

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • Logging 5.8
    • Logging
    • OBSDOCS (Nov 13 - Dev 4) #245, OBSDOCS (Dec 4 - Dev 25) #246

      Goals

      As a cluster admin I can:

      • Limit per-container logging rates (bytes/sec) for selected containers:
        • Optional cluster-wide default for all containers.
        • Specific rate for containers in listed namespaces.
        • Specific rate for containers matching a label selector.
      • Ignore (do not collect) logs from selected containers

      The logging system will drop data, if necessary, to keep containers within their limits.
      Which data gets dropped depends on timing and other run-time factors in the logging stack.

      We want admins to be able to:

      • Set predictable limits on logging
        • Simplify provisioning
        • Avoid unexpected overloads.

      Non-goals

      The following are not goals for this Epic, they will be covered separately:

      Back-pressure (Epic LOG-1073) is a separate Epic. Some use cases will not tolerate back-pressure. Measurement and rate control are needed even with back-pressure.

      Combined rate limits (Epic LOG-1074) are more useful to admins, but more complex to implement (for example, set a combined rate limit for all containers in a namespace). Per-container limits are a necessary first step and have some value alone.   

      Content-based filtering dropping logs selectively based on content (e.g. debug vs. info logs) is something that may be supported in future.

      Motivation

      The logging system lacks flow control. The CRI-O container run-times write to disk as fast as container produce logs, there is no co-ordination with the logging collector reading those files. This results in:

      • Log loss if the logs are written faster than they are read.
      • Back-up of log data at various buffering points; causes slow recovery and high latency.

      We cannot prevent log loss completely, but we need to provide better control over it. In particular we need to ensure that "noisy neighbors" or "bad actors" can't clog up the system and prevent collecting logs from well-behaved applications.

      Acceptance Criteria

      • Verify that you can query the exposed metrics within the OpenShift Console -> Metrics tab.
      • Verify that alerts fire when the defined threshold are exceeded.
      • Verify that a default per-container rate is enforced (data is dropped) correctly.
      • Verify that selective rates by label or namespace are enforced correctly.
      • Verify that ignored logs are not collected or forwarded.

      Dependencies (internal and external)

      • Selector APIs - label selectors, namespace selectors
      • Perf/Scale team to verify performance implications for block policy.

      Previous Work

      • Metrics and dashboards added by LOG-915.

      Open questions

              abrennan@redhat.com Ashleigh Brennan
              bdooley@redhat.com Brian Dooley
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: