-
Bug
-
Resolution: Done
-
Major
-
Logging 5.3.12
-
False
-
None
-
False
-
NEW
-
VERIFIED
-
-
-
-
Log Collection - Sprint 227, Log Collection - Sprint 228
-
Important
Description of problem:
The collector pod contains 2 containers:
- Starts the fluentd process
- Starts the `/usr/local/bin/log-file-metric-exporter` process
If we review both containers, only the first has requests and limits for cpu and memory and they can be managed from the ClusterLogging Operator:
- name: COLLECTOR_CONF_HASH
value: fb4ebfa073fd0ea24153c48f22abdaa9
image: registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:1140e317d111e13c4900c1b6d128c5fdef05b9f319b0bd693665d67f3139d03a
imagePullPolicy: IfNotPresent
name: collector
ports:
- containerPort: 24231
name: metrics
protocol: TCP
resources: <---------------- this is limited as set in the clusterLogging instance
limits:
memory: 2Gi
requests:
cpu: 100m
memory: 1Gi
but the second process `/usr/local/bin/log-file-metric-exporter` has not limits/requests set by default and is even, not able to set them from the ClusterLogging Operator
- command: - /usr/local/bin/log-file-metric-exporter <----------- the same process seen in the output from node consuming 8GB of RAM - ' -verbosity=2' - ' -dir=/var/log/containers' - ' -http=:2112' - ' -keyFile=/etc/fluent/metrics/tls.key' - ' -crtFile=/etc/fluent/metrics/tls.crt' image: registry.redhat.io/openshift-logging/log-file-metric-exporter-rhel8@sha256:2f43018b00df04dcdb0eebb7ae90e91dd60970494d13fd0851d91b996c8b0daf imagePullPolicy: IfNotPresent name: logfilesmetricexporter ports: - containerPort: 2112 name: logfile-metrics protocol: TCP resources: {} <----------------- not limit and not option in the clusterLogging CR instance of doing it
Then, for any unknown reason, the `/usr/local/bin/log-file-metric-exporter` process was starting to increase the usage of memory leading to consuming 8G, the moment in the master OCP node was starting to have big issues leading performance in the cluster since the etcd was starting to answer when reaching this node with high times.
The memory usage by the process was detected in a sosreport, and it was:
Top MEM-using processes: USER PID %CPU %MEM VSZ-MiB RSS-MiB TTY STAT START TIME COMMAND root 7047 8.1 41.3 9616 8295 ? - Mar10 28603:18 /usr/local/bin/log-file-metric-exporter -verbosity=2 -dir=/var/log/containers root 1531508 20.9 10.3 2859 2080 ? - Nov08 299:43 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml root 3882 9.1 5.3 10252 1064 ? - Mar10 32027:28 etcd --logger=zap --log-level=info
Version-Release number of selected component (if applicable):
cluster-logging.5.3.2-20
But the same is happening in the latest version
How reproducible:
Not able to reproduce, but it's easy to review that the container for the process `/usr/local/bin/log-file-metric-exporter` has no limits and is not able to set them.
Actual results:
The container `/usr/local/bin/log-file-metric-exporter` in the collector pods has no limits leading for an unknown reason to consume 8GB of RAM impacting the node ( a master ) and all the clusters.
Expected results:
The container `/usr/local/bin/log-file-metric-exporter` has limits/requests set by default not being able to consume without limits and perhaps, having the option to set them from the clusterLogging Operator.
Then, if something is leading the process to start to consume memory or CPU, the limits stop it.
Additional info:
- links to
- mentioned on