Type: Epic
Resolution: Done-Errata
Priority: Critical
Fix Version/s: Logging 5.8.0
Affects Version/s: None
Component/s: Log Storage, Loki
Labels:
None

Epic Name:
Loki - Zone-Aware Replication
Story Points:
13
Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Green
Docs QE Status:
NEW
Epic Status:
Done
Feature Link:
OBSDA-312 - EFK ES cluster node distribution on multiple zones
QE Status:
NEW
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Release Note Text:
With this update, you can manage zone-aware data replication as an administrator in LokiStack, in order to enhance reliability in the event of a zone failure.
Release Note Type:
Feature
Target Version:

Logging 5.8.0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Goals

Use available pod placement primitives (i.e.Pod Topology Spread Constraints) to spread Loki component pods across availability zones.
Ingest logs across replicas in different availability zones.
Spread query capabilities across availability zones.

Non-Goals

Enable creating/modifying custom topologies via the LokiStack API

Motivation

As described in previous RFE-1215 requests, our customers run the OpenShift Logging stack on clusters that span multiple availability zones. Historically the OpenShift Logging storage services have not leveraged such pod placement configurations with the result that the entire stack was failing to store logs in case of an availability zone failure. Given the fact that our new stack based on LokiStack has built-in support for zone-aware data replication, we want to expose this capabilities to OpenShift Logging as simple as possible.

Alternatives

The only alternative is to one log storage (e.g. LokiStack) per availability zone. This has couple of implications that might not be acceptable per case:

The resource requirements for running a LokiStack per zone doubles.
It is required manual intervention as of today to flip the collectors between the available lokistacks per availability zone.

Acceptance Criteria

The data replication of all LokiStack sizes ensures to spread log ingestion for all tenants across availability zones.
The query capabilities are partially ensured even in the likely event of an single availability zone failure/

Risk and Assumptions

[RISK] Running the memberlist ring across zones for our setup is something that adds latency to ring operations. This is something we need to benchmark to the bone.
[RISK] Running queries and index-gateways across availability means that we have an uniform distribution e.g. 1x.medium declare 3 queries which means 1 in zone a and 2 in zone b and for each zone a single index-gateway replica. This means in case of a zone failure, we loose HA on the other zone too, as 1/2 queries with one index-gateway is left over. We might need to expand the LokiStack sizes in case a user enables zone-aware-replication to ensure that on a zone failure we are still maintaining HA inside the available zone.

Documentation Considerations

Need a good introduction on how users using Pod Topology Spread Constraints can apply their cluster configuration to our API.
We do not indent to give API users to define per Loki component pod template topology spread constraints. We want to limit this to the minimum viable set of options. Leaving general configuration to the cluster administrator.

Open Questions

Does the ingester fail hard when data replication across zones fails?
Does the querier make use of the zone-awareness of the ingesters when querying recent data from them?

Additional Notes

Loki/Cortex Zone-Aware-Replication
OpenShift Docs: Controlling pod placement by using pod topology spread constraints
Kubernetes: Pod Topology Spread Constraints
Previous RFEs for Elasticsearch-based Log Storage: RFE-1215
grafana/loki#7923: Zone-Aware JSONNet Configuration

links to

openshift/openshift-docs#66871: [WIP] [DOCS] Logging 5.8.0 Release Notes

RHBA-2023:6139 Logging Subsystem 5.8.0 - Red Hat OpenShift

Technical Enablement Slides

mentioned in: Page No Confluence page found with the given URL.

1.

QE Tracker

Closed

Unassigned

Errata Tool added a comment - 2023/11/02 7:39 AM

Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

For information on the advisory (Logging Subsystem 5.8.0 - Red Hat OpenShift), and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2023:6139

Errata Tool added a comment - 2023/11/02 7:39 AM Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Logging Subsystem 5.8.0 - Red Hat OpenShift), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:6139

Anping Li added a comment - 2023/10/25 1:41 AM

Yes. replication factor will be set automatically.

Anping Li added a comment - 2023/10/25 1:41 AM Yes. replication factor will be set automatically.

Jeffrey Cantrill added a comment - 2023/10/09 2:17 AM

This issue requires Release Notes Text. Please modify the Release Note Text or set the Release Note Type to "No Doc Update"

Jeffrey Cantrill added a comment - 2023/10/09 2:17 AM This issue requires Release Notes Text. Please modify the Release Note Text or set the Release Note Type to "No Doc Update"

Anping Li added a comment - 2023/08/07 2:15 PM - edited

ptsiraki@redhat.com spad09 rh-ee-mbouqsim Can you review test cases for this EPIC? Thanks. https://url.corp.redhat.com/9614436.

Anping Li added a comment - 2023/08/07 2:15 PM - edited ptsiraki@redhat.com spad09 rh-ee-mbouqsim Can you review test cases for this EPIC? Thanks. https://url.corp.redhat.com/9614436.

Details

Description

Goals

Non-Goals

Motivation

Alternatives

Acceptance Criteria

Risk and Assumptions

Documentation Considerations

Open Questions

Additional Notes

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

[LOG-3266] Loki - Zone-Aware Replication

Collapse comment: Errata Tool added a comment - 2023/11/02 7:39 AM

Expand comment: Errata Tool added a comment - 2023/11/02 7:39 AM

Collapse comment: Anping Li added a comment - 2023/10/25 1:41 AM

Expand comment: Anping Li added a comment - 2023/10/25 1:41 AM

Collapse comment: Jeffrey Cantrill added a comment - 2023/10/09 2:17 AM

Expand comment: Jeffrey Cantrill added a comment - 2023/10/09 2:17 AM

Collapse comment: Anping Li added a comment - 2023/08/07 2:15 PM, Edited by Anping Li - 2023/08/07 2:17 PM

Expand comment: Anping Li added a comment - 2023/08/07 2:15 PM, Edited by Anping Li - 2023/08/07 2:17 PM

People

Dates