Originally, Elasticsearch was introduced for storing logs for a given short amount of time which at the same time had fantastic querying capabilities.
Unfortunately, Elasticsearch has a lot more to offer and is therefore quite “heavy” for the specific use case we support at the moment and want to support in the future. Furthermore, after working with Elasticsearch for a long time we made the following observations
- For our customers, support organization and SRE team it is very difficult to operate without the necessary experience.
- Since we can't really limit how someone uses Elasticsearch/Kibana, customers using more than we expected/allowed.
- Contributing back is difficult and with the new lincense change, it will be nearly impossible for us.
Additionally, we want to provide more capabilities that makes our users have been asking for quite some time such as alerting on logs or correlating logs with other signals like metrics (e.g. showing on logs related to a particular alert). Although this is also possible with Elasticsearch, there might be many compromises we need to take since some of the needed features are only part of the Enterprise edition.
Loki looks pretty promising and after some initial validation internally, we believe that it could fit most, if not all, requirements we have now and others in the future.
We identified the following advantages using Loki:
- Loki does exactly what we want without the additional overhead, e.g. indexing, and more that aligns better with our general vision for Logging.
- Loki is not the holy grail when it comes to maintenance and operational costs but it’s nature does help minimizing both as it runs under a much more simplistic concept.
- Loki is built in a similar way Prometheus has been built and is using concepts that make correlation and integration into an existing Prometheus ecosystem much easier.
- Loki is an open source that already has a much bigger community and we were able to already contribute to its base and GrafanaLabs enables us to continue to do that.
- Reduce the number of Elasticsearch deployments as the default log management storage in all OCP clusters.
- Key Metric: # of Elasticsearch/Kibana CRDs in all subscribed clusters (excluding any internal)
- Number of Loki usage as a default storage must grow.
- Key Metric: # of Loki CRDs in all subscribed clusters (excluding any internal)
- Offer a stable Loki managed service for our internal customers as part of Observatorium and the Red Hat Observability Service (INTERNAL only).
- Key Metric: Uptime
- Provide a simple option to configure a Loki-based log storage instead of using Elasticsearch.
- Provide log forwarding capabilities to a Loki instance (managed by Red Hat or externally by customers)
- Provide Monitoring capabilities (e.g. rules + runbooks, as well as dashboards) to expose the health of the Loki cluster in use.
Access to Loki and exploring logs is part of