Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-2219

Logs in Kibana sometimes appear out of order.

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • Log Collection, Log Storage
    • None
    • Logs out of order
    • False
    • False
    • NEW
    • Done
    • NEW

      Goals

      Allow users of ElasticSearch and Kibana to list logs in the order they were written by including a sequence number in the log record. Kibana queries would sort logs by the existing time-stamp first, and use the sequence number to sort logs with identical timestamps.

      Motivation

      Logs are written to elasticsearch in the same order they were written by the application, however elasticsearch does not record the order they arrive in.

      Kibana tries to restore order by sorting by timestamp, but this is not reliable. Log timestamps often have millisecond resolution or worse, so it is possible for 2 or more logs to have the same timestamp.

      Alternatives

      Using a nano-second timestamp was suggested as an alternative, but this is less reliable and more complex than a sequence number.

      • The existing time stamps are written by the original application, the logging stack has no control over their resolution.
      • An additional time stamp written by the log collector  is at the mercy of the collector implementation (fluentd) as to its accuracy. Just because a timestamp has enough digits to hold nanoseconds does not mean it has real nanosecond precision.
        • PC clocks are not nanosecond precise, between 10 and 100 microseconds is common. If you can write duplicate messages in 1 millisecond, you can probably do it in 0.1 milliseconds too.
        • Polling applications like fluentd often check the time once per IO action (read/write) and use the same timestamp for processing the entire read. This is to avoid the overhead of multiple clock system-calls for a single IO system call. This means that multiple log lines read in a batch are likely to get duplicate timestamps even if the host clock is precise enough to distinguish the times they were processed.
        • Incrementing a counter for each log line is much less expensive that checking the system clock multiple times per read.

      Acceptance Criteria

      • Log records to include a new field `openshift.sequence`, an integer value.
      • Sequence values increase to a maximum and then return to 0
      • The max value is large enough that wrap-around from 0 to 0  again takes much longer than a second. Logs  with duplicate timestamps (even second resolution timestamps) will always have distinct sequence numbers.

      Risk and Assumptions

      Need to add the counter in a way that does not impact performance. Seems feasible since we already add fields based on actions much more complex than incrementing a counter.

      There is a chance of mis-ordered logs when the counter wraps around to 0. Using a 32 bit counter this would happen roughly every 4 billion logs so seems like an acceptable risk.

      Documentation Considerations

      Add to doc data model docs.

      Open Questions

      Elasticsearch has a `seq_no` field of its own. It is possible this could be used to solve the problem entirely within elasticsearch-kibana, without the feature described here.

      Research needed.

      Additional Notes

              Unassigned Unassigned
              rhn-engineering-aconway Alan Conway
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: