Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: Log Storage
Labels:
None

Blocked:
False
Ready:
False
Docs QE Status:
NEW
QE Status:
NEW

SFDC Cases Links:
SFDC Cases Counter:

Description

ES fails to ingest logs from Fluentd

~~~
2021-09-29T17:25:42.313046562Z at java.lang.Thread.run(Thread.java:829) ~[?:?]
2021-09-29T17:25:43.981581201Z [2021-09-29T17:25:43,981][INFO ][o.e.m.j.JvmGcMonitorService] [gc][16008] overhead, spent [383ms] collecting in the last [1s]
2021-09-29T17:25:48.846574017Z [2021-09-29T17:25:48,845][DEBUG][o.e.a.s.TransportSearchAction] [app-000172][1], node[EWwKdOjYSlWlE0RugF1AnQ], [P], s[STARTED], a[id=aFnyBYUxT6ec9YnhcQpJYQ]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[audit-000009, audit-000008, audit-000007, audit-000006, infra-000018, infra-000019, audit-000005, infra-000014, infra-000015, infra-000016, infra-000017, infra-000021, infra-000022, infra-000023, infra-000024, infra-000020, app-000170, app-000171, app-000176, app-000172, app-000173, app-000174, app-000175, app-000160, app-000165, audit-000023, app-000166, audit-000022, app-000167, audit-000021, app-000168, audit-000020, app-000161, app-000162, app-000163, app-000164, audit-000024, app-000169, audit-000019, audit-000018, audit-000017, infra-000007, audit-000012, infra-000008, audit-000011, infra-000009, audit-000010, app-000157, audit-000016, audit-000015, audit-000014, infra-000005, audit-000013, infra-000006, infra-000010, infra-000011, infra-000012, infra-000013, app-000158, app-000159], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=false, scroll=null, maxConcurrentShardRequests=25, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, source={"size":0,"query":{"range":{"@timestamp":

{"from":"now-24h","to":"now","include_lower":true,"include_upper":true,"boost":1.0}

}},"aggregations":{"Histogram":{"date_histogram":{"field":"@timestamp","interval":"hour","offset":0,"order":

{"_key":"asc"}

,"keyed":false,"min_doc_count":0},"aggregations":{"top_namespaces":{"terms":{"field":"kubernetes.namespace_name","size":1000,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[

{"_count":"desc"}

,{"_key":"asc"}]}},"namespace_count":{"cardinality":

{"field":"kubernetes.namespace_name"}

}}}}}}] lastShard [true]
2021-09-29T17:25:48.846574017Z org.elasticsearch.transport.RemoteTransportException: [elasticsearch-cd-yw22zn1v-1][10.131.8.111:9300][indices:data/read/search[phase/query]]
2021-09-29T17:25:48.846574017Z Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@21f9aa10 on QueueResizingEsThreadPoolExecutor[name = elasticsearch-cd-yw22zn1v-1/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 554.1ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@7052fb9[Running, pool size = 2, active threads = 2, queued tasks = 1298, completed tasks = 245348]]
2021-09-29T17:25:48.846574017Z at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
~~~

~~~> cat es/cluster-elasticsearch/healthepoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1632936727 17:32:07 elasticsearch green 5 5 620 310 0 0 0 0 - 100.0%
~~~

~~~
logStore:
elasticsearch:
nodeCount: 5
proxy:
resources:
limits:
memory: 256Mi
requests:
memory: 256Mi
redundancyPolicy: SingleRedundancy
resources:
limits:
memory: 36G
requests:
memory: 36G
storage:
size: 500G
storageClassName: perf-no-snap
~~~

Fluentd Queue
~~~
$ for i in $(oc get pods -l component=fluentd | awk '/fluentd/ { print $1 }') ; do oc exec $i – du -sh /var/lib/fluentd ; done

Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-2kwkc -n openshift-logging' to see all of the containers in this pod.
40G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-57htx -n openshift-logging' to see all of the containers in this pod.
25G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-7gc7c -n openshift-logging' to see all of the containers in this pod.
17G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-8kbh2 -n openshift-logging' to see all of the containers in this pod.
17G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-bqp2g -n openshift-logging' to see all of the containers in this pod.
40G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-c94br -n openshift-logging' to see all of the containers in this pod.
23G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-hstqg -n openshift-logging' to see all of the containers in this pod.
21G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-j4rcx -n openshift-logging' to see all of the containers in this pod.
23G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-jxhq4 -n openshift-logging' to see all of the containers in this pod.
15G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-mmtdr -n openshift-logging' to see all of the containers in this pod.
20G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-mql67 -n openshift-logging' to see all of the containers in this pod.
23G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-ph4vz -n openshift-logging' to see all of the containers in this pod.
22G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-rqcp2 -n openshift-logging' to see all of the containers in this pod.
21G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-vknsb -n openshift-logging' to see all of the containers in this pod.
16G /var/lib/fluentd
Defaulting container name to fluentd.
Use 'oc describe pod/fluentd-wqvqh -n openshift-logging' to see all of the containers in this pod.
22G /var/lib/fluentd
~~~

Version-Release number of selected component (if applicable):

installedCSV: cluster-logging.5.2.1-5
installedCSV: elasticsearch-operator.5.2.1-5

OCP 4.7

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Fluentd queues are huge due to ES not ingesting the logs as per the error above. And Kibana does not show logs.

Expected results:
Kibana to show logs

Additional info:

Thread pool
~~~
node_name name active queue rejected
elasticsearch-cdm-o72himb8-2 search 2 206 103124
elasticsearch-cd-yw22zn1v-2 search 2 1155 85794
elasticsearch-cdm-o72himb8-3 search 2 1056 520289
~~~

OCP 4.7 ES failing with Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable

Details

Description

Attachments

Activity

People

Dates