Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-2703

OCP should alarm/alert when the etcd container memory consumption goes beyond 90%

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • etcd
    • None
    • False
    • None
    • False
    • Not Selected

      1. Proposed title of this feature request
      --> Alert generation when the etcd container memory consumption goes beyond 90%

      2. What is the nature and description of the request?
      --> When the etcd database starts growing rapidly due to some high number of objects like secrets, events, or configmap generation by application/workload, the memory and CPU consumption of APIserver and etcd container (control plane component) spikes up and eventually the control plane nodes goes to hung/unresponsive or crash due to out of memory errors as some of the critical processes/services running on master nodes get killed. Hence we request an alert/alarm when the ETCD container's memory consumption goes beyond 90% so that the cluster administrator can take some action before the cluster/nodes go unresponsive.

      I see we already have a etcdExcessiveDatabaseGrowth Prometheus rule which helps when the surge in etcd writes leading to a 50% increase in database size over the past four hours on etcd instance however it does not consider the memory consumption:

      $ oc get prometheusrules etcd-prometheus-rules -o yaml|grep -i etcdExcessiveDatabaseGrowth -A 9

      • alert: etcdExcessiveDatabaseGrowth
        annotations:
        description: 'etcd cluster "{{ $labels.job }}": Observed surge in etcd writes
        leading to 50% increase in database size over the past four hours on etcd
        instance {{ $labels.instance }}, please check as it might be disruptive.'
        expr: |
        increase(((etcd_mvcc_db_total_size_in_bytes/etcd_server_quota_backend_bytes)*100)[240m:1m]) > 50
        for: 10m
        labels:
        severity: warning

      3. Why does the customer need this? (List the business requirements here)
      --> Once the etcd memory consumption goes beyond 90-95% of total ram as it's system critical container, the OCP cluster goes unresponsive causing revenue loss to business and impacting the productivity of users of the openshift cluster. 

       

      4. List any affected packages or components.
      --> etcd

            melbeher@redhat.com Mustafa Elbehery
            rhn-support-asadawar Abhijeet Sadawarte
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: