[RFE-2703] OCP should alarm/alert when the etcd container memory consumption goes beyond 90%

Type: Feature Request
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: etcd
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Color Status:
Not Selected
PX Impact Score:
PX Priority Data:
PX Review Complete:
Target Version:

openshift-4.12

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

1. Proposed title of this feature request
--> Alert generation when the etcd container memory consumption goes beyond 90%

2. What is the nature and description of the request?
--> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.

I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:

$ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9

alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
leading to 50% increase in database size over the past four hours on etcd
instance {{ $labels.instance }}, please check as it might be disruptive.'
expr: |
increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
for: 10m
labels:
severity: warning

3. Why does the customer need this? (List the business requirements here)
--> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster.

4. List any affected packages or components.
--> etcd

causes

ETCD-267 OCP should alarm/alert when the etcd container memory consumption goes beyond 90%

Closed

is related to

API-1391 Alert on control plane memory usage

Closed

links to

openshift/cluster-etcd-operator#807: RFE-2703: add prometheus rules for etcd memory usage

openshift/machine-config-operator#3124: RFE-2703: add prometheus rules for master memory usage

openshift/runbooks#53: RFE-2703: add prometheus rules for master memory usage

There are no comments yet on this issue.

Assignee:: Mustafa Elbehery

Reporter:: Abhijeet Sadawarte

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2022/03/21 12:23 PM

Updated:: 2025/03/07 5:17 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates