Goals

Allow users of ElasticSearch and Kibana to list logs in the order they were written by including a sequence number in the log record. Kibana queries would sort logs by the existing time-stamp first, and use the sequence number to sort logs with identical timestamps.

Motivation

Logs are written to elasticsearch in the same order they were written by the application, however elasticsearch does not record the order they arrive in.

Kibana tries to restore order by sorting by timestamp, but this is not reliable. Log timestamps often have millisecond resolution or worse, so it is possible for 2 or more logs to have the same timestamp.

Alternatives

Using a nano-second timestamp was suggested as an alternative, but this is less reliable and more complex than a sequence number.

The existing time stamps are written by the original application, the logging stack has no control over their resolution.
An additional time stamp written by the log collector is at the mercy of the collector implementation (fluentd) as to its accuracy. Just because a timestamp has enough digits to hold nanoseconds does not mean it has real nanosecond precision.
- PC clocks are not nanosecond precise, between 10 and 100 microseconds is common. If you can write duplicate messages in 1 millisecond, you can probably do it in 0.1 milliseconds too.
- Polling applications like fluentd often check the time once per IO action (read/write) and use the same timestamp for processing the entire read. This is to avoid the overhead of multiple clock system-calls for a single IO system call. This means that multiple log lines read in a batch are likely to get duplicate timestamps even if the host clock is precise enough to distinguish the times they were processed.
- Incrementing a counter for each log line is much less expensive that checking the system clock multiple times per read.

Acceptance Criteria

Log records to include a new field `openshift.sequence`, an integer value.
Sequence values increase to a maximum and then return to 0
The max value is large enough that wrap-around from 0 to 0 again takes much longer than a second. Logs with duplicate timestamps (even second resolution timestamps) will always have distinct sequence numbers.

Risk and Assumptions

Need to add the counter in a way that does not impact performance. Seems feasible since we already add fields based on actions much more complex than incrementing a counter.

There is a chance of mis-ordered logs when the counter wraps around to 0. Using a 32 bit counter this would happen roughly every 4 billion logs so seems like an acceptable risk.

Documentation Considerations

Add to doc data model docs.

Open Questions

Elasticsearch has a `seq_no` field of its own. It is possible this could be used to solve the problem entirely within elasticsearch-kibana, without the feature described here.

Research needed.

Additional Notes

Assignee:: Unassigned

Reporter:: Alan Conway

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2022/02/09 5:14 PM

Updated:: 2022/04/11 9:55 AM

Resolved:: 2022/02/14 5:18 PM

Details

Description

Goals

Motivation

Alternatives

Acceptance Criteria

Risk and Assumptions

Documentation Considerations

Open Questions

Additional Notes

Attachments

Easy Agile Planning Poker

Activity

People

Dates