Uploaded image for project: 'OpenShift Logging'
  1. OpenShift Logging
  2. LOG-1897

Applying cluster state is causing elasticsearch to hit an issue and become unusable

XMLWordPrintable

    • False
    • False
    • NEW
    • VERIFIED
    • Hide
      Before this update, the Elasticsearch Prometheus exporter plugin was compiling index-level metrics using a high-cost query which had a negative performance impact on the processing Elasticsearch node. With this update, this query has been replaced with a low-impact query that will not impact performance.
      Show
      Before this update, the Elasticsearch Prometheus exporter plugin was compiling index-level metrics using a high-cost query which had a negative performance impact on the processing Elasticsearch node. With this update, this query has been replaced with a low-impact query that will not impact performance.
    • Logging (LogExp) - Sprint 210

      In the past two days we experience this issue on two different OpenShift Container Platform 4.7 - Cluster, running Cluster Logging v5.2.2-21.

      In Cluster Logging with 6 or more elasticsearch Cluster members we suddenly noticed that one elasticsearch pod/container was reporting not ready state. When checking details, we found that it was heavily loaded with regards to CPU (44 Load Avg over 1 Minute - while the other pods were at 4 Load Avg over 1 Minute).

      When checking the logs we found the below:

      2021-10-27T20:29:45.775188986Z [2021-10-27T20:29:45,775][INFO ][o.e.c.s.ClusterApplierService] added {{elasticsearch-cdm-kv7dwayj-1}{Mt5-JbdeQcON8A2fwa6H2w}{kTEZzBK3Tfysg_Xy-EI7xg}{10.94.34.9}{10.94.34.9:9300},}, reason: apply cluster state (from master [master {elasticsearch-cdm-kv7dwayj-3}{vuPW92BIS3-6kb62pUIbyg}{sUzrH87WSBOxf0qiq4dXjQ}{10.94.19.9}{10.94.19.9:9300} committed version [71833]])
      2021-10-27T20:29:59.471509362Z [2021-10-27T20:29:59,471][WARN ][c.a.o.s.c.ConfigurationLoader] No data for audit while retrieving configuration for [config, roles, rolesmapping, internalusers, actiongroups, nodesdn, audit]  (index=.security)
      2021-10-27T20:29:59.490953334Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing on REST API is enabled.
      2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] [AUTHENTICATED, GRANTED_PRIVILEGES] are excluded from REST API auditing.
      2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing on Transport API is enabled.
      2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] [AUTHENTICATED, GRANTED_PRIVILEGES] are excluded from Transport API auditing.
      2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing of request body is enabled.
      2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Bulk requests resolution is disabled during request auditing.
      2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Index resolution is enabled during request auditing.
      2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Sensitive headers auditing is enabled.
      2021-10-27T20:29:59.491042435Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing requests from kibanaserver users is disabled.
      2021-10-27T20:29:59.491042435Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing of external configuration is disabled.
      2021-10-27T20:29:59.491053535Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing of internal configuration is disabled.
      2021-10-27T20:29:59.491063335Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing only metadata information for read request is disabled.
      2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing will watch {} for read requests.
      2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing read operation requests from kibanaserver users is disabled.
      2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing only metadata information for write request is disabled.
      2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing diffs for write requests is disabled.
      2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing write operation requests from kibanaserver users is disabled.
      2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing will watch <NONE> for write requests.
      2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] .security is used as internal security index.
      2021-10-27T20:29:59.491164337Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Internal index used for posting audit logs is null
      2021-10-27T23:03:02.106191331Z java.lang.UnsupportedOperationException
      2021-10-27T23:03:02.106454935Z 	at java.base/java.util.Collections$UnmodifiableMap.put(Collections.java:1457)
      2021-10-27T23:03:02.106454935Z 	at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.PermissionsInfoAction$1.accept(PermissionsInfoAction.java:103)
      2021-10-27T23:03:02.106491735Z 	at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.PermissionsInfoAction$1.accept(PermissionsInfoAction.java:87)
      2021-10-27T23:03:02.106491735Z 	at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:115)
      2021-10-27T23:03:02.106491735Z 	at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityRestFilter$1.handleRequest(OpenDistroSecurityRestFilter.java:94)
      2021-10-27T23:03:02.106491735Z 	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:240)
      2021-10-27T23:03:02.106491735Z 	at org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:336)
      2021-10-27T23:03:02.106491735Z 	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:174)
      2021-10-27T23:03:02.106491735Z 	at com.amazon.opendistroforelasticsearch.security.ssl.http.netty.ValidatingDispatcher.dispatchRequest(ValidatingDispatcher.java:76)
      2021-10-27T23:03:02.106491735Z 	at org.elasticsearch.http.netty4.Netty4HttpServerTransport.dispatchRequest(Netty4HttpServerTransport.java:551)
      2021-10-27T23:03:02.106505935Z 	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:137)
      2021-10-27T23:03:02.106505935Z 	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
      2021-10-27T23:03:02.106516435Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      2021-10-27T23:03:02.106516435Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)2021-10-27T23:03:02.106541736Z 
      2021-10-27T23:03:02.106541736Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      2021-10-27T23:03:02.106541736Z 	at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68)
      2021-10-27T23:03:02.106541736Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      2021-10-27T23:03:02.106541736Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106541736Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      2021-10-27T23:03:02.106553236Z 	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
      2021-10-27T23:03:02.106553236Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      2021-10-27T23:03:02.106553236Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106564436Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      2021-10-27T23:03:02.106564436Z 	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
      2021-10-27T23:03:02.106564436Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)2021-10-27T23:03:02.106574636Z 
      2021-10-27T23:03:02.106574636Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106574636Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      2021-10-27T23:03:02.106584636Z 	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323)
      2021-10-27T23:03:02.106584636Z 	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297)
      2021-10-27T23:03:02.106595436Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      2021-10-27T23:03:02.106595436Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106595436Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)2021-10-27T23:03:02.106605737Z 
      2021-10-27T23:03:02.106605737Z 	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
      2021-10-27T23:03:02.106615737Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      2021-10-27T23:03:02.106615737Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106625537Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      2021-10-27T23:03:02.106625537Z 	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
      2021-10-27T23:03:02.106625537Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)2021-10-27T23:03:02.106635637Z 
      2021-10-27T23:03:02.106635637Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106635637Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      2021-10-27T23:03:02.106648637Z 	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1436)
      2021-10-27T23:03:02.106648637Z 	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203)
      2021-10-27T23:03:02.106648637Z 	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247)2021-10-27T23:03:02.106658937Z 
      2021-10-27T23:03:02.106658937Z 	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
      2021-10-27T23:03:02.106658937Z 	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
      2021-10-27T23:03:02.106695838Z 	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
      2021-10-27T23:03:02.106695838Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      2021-10-27T23:03:02.106695838Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106712638Z 	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
      2021-10-27T23:03:02.106712638Z 	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
      2021-10-27T23:03:02.106712638Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
      2021-10-27T23:03:02.106712638Z 	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
      2021-10-27T23:03:02.106723438Z 	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
      2021-10-27T23:03:02.106723438Z 	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
      2021-10-27T23:03:02.106733738Z 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656)
      2021-10-27T23:03:02.106733738Z 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556)
      2021-10-27T23:03:02.106744039Z 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510)
      2021-10-27T23:03:02.106744039Z 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470)
      2021-10-27T23:03:02.106744039Z 	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)2021-10-27T23:03:02.106754139Z 
      2021-10-27T23:03:02.106754139Z 	at java.base/java.lang.Thread.run(Thread.java:829)
      [...]
      2021-10-28T06:31:22.034446019Z [2021-10-28T06:31:22,034][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336651] overhead, spent [282ms] collecting in the last [1s]
      2021-10-28T06:31:53.958301553Z [2021-10-28T06:31:53,933][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336682] overhead, spent [263ms] collecting in the last [1s]
      2021-10-28T06:33:56.385598998Z [2021-10-28T06:33:56,377][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336801] overhead, spent [265ms] collecting in the last [1s]
      2021-10-28T06:35:13.681730622Z [2021-10-28T06:35:13,681][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336876] overhead, spent [516ms] collecting in the last [1s]
      2021-10-28T06:35:38.848591269Z [2021-10-28T06:35:38,829][WARN ][r.suppressed             ] [elasticsearch-cd-a76lica6-2] path: /_prometheus/metrics, params: {}
      2021-10-28T06:35:38.848591269Z org.elasticsearch.ElasticsearchException: Search request failed
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.TransportNodePrometheusMetricsAction$AsyncAction$1.onFailure(TransportNodePrometheusMetricsAction.java:160) ~[?:?]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:238) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.onRejection(AbstractRunnable.java:63) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.TimedRunnable.onRejection(TimedRunnable.java:50) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onRejection(ThreadContext.java:741) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:104) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.execute(AbstractSearchAsyncAction.java:311) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.FetchSearchPhase.run(FetchSearchPhase.java:80) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:165) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.InitialSearchPhase.successfulShardExecution(InitialSearchPhase.java:254) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.InitialSearchPhase.onShardResult(InitialSearchPhase.java:242) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$innerOnResponse$0(InitialSearchPhase.java:215) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:187) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
      2021-10-28T06:35:38.848591269Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
      2021-10-28T06:35:38.848591269Z 	at java.lang.Thread.run(Thread.java:829) ~[?:?]
      2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: 
      2021-10-28T06:35:38.848591269Z 	... 23 more
      2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5ab14b9 on QueueResizingEsThreadPoolExecutor[name = elasticsearch-cd-a76lica6-2/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.8s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@e87b0fa[Running, pool size = 19, active threads = 19, queued tasks = 1124, completed tasks = 23997624]]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
      2021-10-28T06:35:38.848591269Z 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
      2021-10-28T06:35:38.848591269Z 	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
      2021-10-28T06:35:38.848591269Z 	... 17 more
      2021-10-28T06:36:16.719970982Z [2021-10-28T06:36:16,714][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336937] overhead, spent [269ms] collecting in the last [1s]
      2021-10-28T06:37:45.756689890Z [2021-10-28T06:37:45,756][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337025] overhead, spent [270ms] collecting in the last [1s]
      2021-10-28T06:38:02.045178190Z [2021-10-28T06:38:02,045][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337041] overhead, spent [375ms] collecting in the last [1.1s]
      2021-10-28T06:38:21.008915287Z [2021-10-28T06:38:21,003][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337059] overhead, spent [340ms] collecting in the last [1s]
      2021-10-28T06:39:04.040390131Z [2021-10-28T06:39:04,040][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337101] overhead, spent [259ms] collecting in the last [1s]
      2021-10-28T06:39:11.170761807Z [2021-10-28T06:39:11,170][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337108] overhead, spent [446ms] collecting in the last [1s]
      2021-10-28T06:40:11.338977208Z [2021-10-28T06:40:11,330][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337167] overhead, spent [472ms] collecting in the last [1s]
      2021-10-28T06:41:09.358445472Z [2021-10-28T06:41:09,358][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337223] overhead, spent [339ms] collecting in the last [1.2s]
      2021-10-28T06:43:10.781494464Z [2021-10-28T06:43:10,781][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337340] overhead, spent [477ms] collecting in the last [1s]
      2021-10-28T06:44:14.139935462Z [2021-10-28T06:44:14,139][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337402] overhead, spent [442ms] collecting in the last [1s]
      2021-10-28T06:45:07.767251526Z [2021-10-28T06:45:07,761][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337454] overhead, spent [251ms] collecting in the last [1s]
      2021-10-28T06:45:14.793049841Z [2021-10-28T06:45:14,790][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337461] overhead, spent [354ms] collecting in the last [1s]
      2021-10-28T06:45:53.615394345Z [2021-10-28T06:45:53,615][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337499] overhead, spent [268ms] collecting in the last [1s]
      2021-10-28T06:46:12.412161308Z [2021-10-28T06:46:12,409][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337517] overhead, spent [353ms] collecting in the last [1.1s]
      2021-10-28T06:46:17.420294317Z [2021-10-28T06:46:17,420][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337522] overhead, spent [290ms] collecting in the last [1s]
      2021-10-28T06:48:16.348584435Z [2021-10-28T06:48:16,348][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337638] overhead, spent [495ms] collecting in the last [1.2s]
      2021-10-28T06:49:18.555172931Z [2021-10-28T06:49:18,554][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337699] overhead, spent [257ms] collecting in the last [1s]
      2021-10-28T06:49:46.192533578Z [2021-10-28T06:49:46,192][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337726] overhead, spent [345ms] collecting in the last [1s]
      2021-10-28T06:50:21.424803595Z [2021-10-28T06:50:21,424][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337760] overhead, spent [400ms] collecting in the last [1.2s]
      2021-10-28T06:51:25.162650797Z [2021-10-28T06:51:25,162][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337822] overhead, spent [396ms] collecting in the last [1s]
      

      Based on the error it looked like not enough resources were available. But it seemed strange that this would happen out of the blue and only affect one elasticsearch pod.

      Anyway, the elasticsearch pod restarted to get everything back and then the interesting effect happen. While the problematic elasticsearch pod came back and reported ready, we saw immediately another elasticsearch pod going not ready and reporting the exact same problem in the logs and again CPU usage went up crazy.

      So we ended up restarting one elasticsearch pod after the other as we constantly saw the effect moving around. Only when scaling all elasticsearch deployments to 0 and getting them back up we were able to recover the situation. After that all appears to be stable but of course this behavior is nasty and we need to understand what is happening here and how to prevent that.

      As mentioned this happen on two different environments in the past two days and we are certain that we have enough CPU and Memory resources available.

              ptsiraki@redhat.com Periklis Tsirakidis
              rhn-support-sreber Simon Reber
              Qiaoling Tang Qiaoling Tang
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: