-
Epic
-
Resolution: Done-Errata
-
Critical
-
None
-
None
-
Loki - Zone-Aware Replication
-
13
-
False
-
None
-
False
-
Green
-
NEW
-
Done
-
OBSDA-312 - EFK ES cluster node distribution on multiple zones
-
NEW
-
0% To Do, 0% In Progress, 100% Done
-
With this update, you can manage zone-aware data replication as an administrator in LokiStack, in order to enhance reliability in the event of a zone failure.
-
Feature
Goals
- Use available pod placement primitives (i.e.Pod Topology Spread Constraints) to spread Loki component pods across availability zones.
- Ingest logs across replicas in different availability zones.
- Spread query capabilities across availability zones.
Non-Goals
- Enable creating/modifying custom topologies via the LokiStack API
Motivation
As described in previous RFE-1215 requests, our customers run the OpenShift Logging stack on clusters that span multiple availability zones. Historically the OpenShift Logging storage services have not leveraged such pod placement configurations with the result that the entire stack was failing to store logs in case of an availability zone failure. Given the fact that our new stack based on LokiStack has built-in support for zone-aware data replication, we want to expose this capabilities to OpenShift Logging as simple as possible.
Alternatives
The only alternative is to one log storage (e.g. LokiStack) per availability zone. This has couple of implications that might not be acceptable per case:
- The resource requirements for running a LokiStack per zone doubles.
- It is required manual intervention as of today to flip the collectors between the available lokistacks per availability zone.
Acceptance Criteria
- The data replication of all LokiStack sizes ensures to spread log ingestion for all tenants across availability zones.
- The query capabilities are partially ensured even in the likely event of an single availability zone failure/
Risk and Assumptions
- [RISK] Running the memberlist ring across zones for our setup is something that adds latency to ring operations. This is something we need to benchmark to the bone.
- [RISK] Running queries and index-gateways across availability means that we have an uniform distribution e.g. 1x.medium declare 3 queries which means 1 in zone a and 2 in zone b and for each zone a single index-gateway replica. This means in case of a zone failure, we loose HA on the other zone too, as 1/2 queries with one index-gateway is left over. We might need to expand the LokiStack sizes in case a user enables zone-aware-replication to ensure that on a zone failure we are still maintaining HA inside the available zone.
Documentation Considerations
- Need a good introduction on how users using Pod Topology Spread Constraints can apply their cluster configuration to our API.
- We do not indent to give API users to define per Loki component pod template topology spread constraints. We want to limit this to the minimum viable set of options. Leaving general configuration to the cluster administrator.
Open Questions
- Does the ingester fail hard when data replication across zones fails?
- Does the querier make use of the zone-awareness of the ingesters when querying recent data from them?
Additional Notes
- Loki/Cortex Zone-Aware-Replication
- OpenShift Docs: Controlling pod placement by using pod topology spread constraints
- Kubernetes: Pod Topology Spread Constraints
- Previous RFEs for Elasticsearch-based Log Storage: RFE-1215
- grafana/loki#7923: Zone-Aware JSONNet Configuration
- links to
-
RHBA-2023:6139 Logging Subsystem 5.8.0 - Red Hat OpenShift
- mentioned in
-
Page Loading...