-
Bug
-
Resolution: Done
-
Critical
-
Logging 5.2.2
-
False
-
False
-
NEW
-
VERIFIED
-
-
-
Logging (LogExp) - Sprint 210
In the past two days we experience this issue on two different OpenShift Container Platform 4.7 - Cluster, running Cluster Logging v5.2.2-21.
In Cluster Logging with 6 or more elasticsearch Cluster members we suddenly noticed that one elasticsearch pod/container was reporting not ready state. When checking details, we found that it was heavily loaded with regards to CPU (44 Load Avg over 1 Minute - while the other pods were at 4 Load Avg over 1 Minute).
When checking the logs we found the below:
2021-10-27T20:29:45.775188986Z [2021-10-27T20:29:45,775][INFO ][o.e.c.s.ClusterApplierService] added {{elasticsearch-cdm-kv7dwayj-1}{Mt5-JbdeQcON8A2fwa6H2w}{kTEZzBK3Tfysg_Xy-EI7xg}{10.94.34.9}{10.94.34.9:9300},}, reason: apply cluster state (from master [master {elasticsearch-cdm-kv7dwayj-3}{vuPW92BIS3-6kb62pUIbyg}{sUzrH87WSBOxf0qiq4dXjQ}{10.94.19.9}{10.94.19.9:9300} committed version [71833]]) 2021-10-27T20:29:59.471509362Z [2021-10-27T20:29:59,471][WARN ][c.a.o.s.c.ConfigurationLoader] No data for audit while retrieving configuration for [config, roles, rolesmapping, internalusers, actiongroups, nodesdn, audit] (index=.security) 2021-10-27T20:29:59.490953334Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing on REST API is enabled. 2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] [AUTHENTICATED, GRANTED_PRIVILEGES] are excluded from REST API auditing. 2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing on Transport API is enabled. 2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] [AUTHENTICATED, GRANTED_PRIVILEGES] are excluded from Transport API auditing. 2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing of request body is enabled. 2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Bulk requests resolution is disabled during request auditing. 2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,490][INFO ][c.a.o.s.a.i.AuditLogImpl ] Index resolution is enabled during request auditing. 2021-10-27T20:29:59.491018935Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Sensitive headers auditing is enabled. 2021-10-27T20:29:59.491042435Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing requests from kibanaserver users is disabled. 2021-10-27T20:29:59.491042435Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing of external configuration is disabled. 2021-10-27T20:29:59.491053535Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing of internal configuration is disabled. 2021-10-27T20:29:59.491063335Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing only metadata information for read request is disabled. 2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing will watch {} for read requests. 2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing read operation requests from kibanaserver users is disabled. 2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing only metadata information for write request is disabled. 2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing diffs for write requests is disabled. 2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing write operation requests from kibanaserver users is disabled. 2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Auditing will watch <NONE> for write requests. 2021-10-27T20:29:59.491148636Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] .security is used as internal security index. 2021-10-27T20:29:59.491164337Z [2021-10-27T20:29:59,491][INFO ][c.a.o.s.a.i.AuditLogImpl ] Internal index used for posting audit logs is null 2021-10-27T23:03:02.106191331Z java.lang.UnsupportedOperationException 2021-10-27T23:03:02.106454935Z at java.base/java.util.Collections$UnmodifiableMap.put(Collections.java:1457) 2021-10-27T23:03:02.106454935Z at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.PermissionsInfoAction$1.accept(PermissionsInfoAction.java:103) 2021-10-27T23:03:02.106491735Z at com.amazon.opendistroforelasticsearch.security.dlic.rest.api.PermissionsInfoAction$1.accept(PermissionsInfoAction.java:87) 2021-10-27T23:03:02.106491735Z at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:115) 2021-10-27T23:03:02.106491735Z at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityRestFilter$1.handleRequest(OpenDistroSecurityRestFilter.java:94) 2021-10-27T23:03:02.106491735Z at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:240) 2021-10-27T23:03:02.106491735Z at org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:336) 2021-10-27T23:03:02.106491735Z at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:174) 2021-10-27T23:03:02.106491735Z at com.amazon.opendistroforelasticsearch.security.ssl.http.netty.ValidatingDispatcher.dispatchRequest(ValidatingDispatcher.java:76) 2021-10-27T23:03:02.106491735Z at org.elasticsearch.http.netty4.Netty4HttpServerTransport.dispatchRequest(Netty4HttpServerTransport.java:551) 2021-10-27T23:03:02.106505935Z at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:137) 2021-10-27T23:03:02.106505935Z at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) 2021-10-27T23:03:02.106516435Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 2021-10-27T23:03:02.106516435Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)2021-10-27T23:03:02.106541736Z 2021-10-27T23:03:02.106541736Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 2021-10-27T23:03:02.106541736Z at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68) 2021-10-27T23:03:02.106541736Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 2021-10-27T23:03:02.106541736Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106541736Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 2021-10-27T23:03:02.106553236Z at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) 2021-10-27T23:03:02.106553236Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 2021-10-27T23:03:02.106553236Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106564436Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 2021-10-27T23:03:02.106564436Z at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) 2021-10-27T23:03:02.106564436Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)2021-10-27T23:03:02.106574636Z 2021-10-27T23:03:02.106574636Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106574636Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 2021-10-27T23:03:02.106584636Z at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) 2021-10-27T23:03:02.106584636Z at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) 2021-10-27T23:03:02.106595436Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 2021-10-27T23:03:02.106595436Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106595436Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)2021-10-27T23:03:02.106605737Z 2021-10-27T23:03:02.106605737Z at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) 2021-10-27T23:03:02.106615737Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 2021-10-27T23:03:02.106615737Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106625537Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 2021-10-27T23:03:02.106625537Z at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) 2021-10-27T23:03:02.106625537Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)2021-10-27T23:03:02.106635637Z 2021-10-27T23:03:02.106635637Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106635637Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 2021-10-27T23:03:02.106648637Z at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1436) 2021-10-27T23:03:02.106648637Z at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203) 2021-10-27T23:03:02.106648637Z at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247)2021-10-27T23:03:02.106658937Z 2021-10-27T23:03:02.106658937Z at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) 2021-10-27T23:03:02.106658937Z at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) 2021-10-27T23:03:02.106695838Z at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) 2021-10-27T23:03:02.106695838Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 2021-10-27T23:03:02.106695838Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106712638Z at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) 2021-10-27T23:03:02.106712638Z at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) 2021-10-27T23:03:02.106712638Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) 2021-10-27T23:03:02.106712638Z at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) 2021-10-27T23:03:02.106723438Z at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) 2021-10-27T23:03:02.106723438Z at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) 2021-10-27T23:03:02.106733738Z at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) 2021-10-27T23:03:02.106733738Z at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) 2021-10-27T23:03:02.106744039Z at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) 2021-10-27T23:03:02.106744039Z at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) 2021-10-27T23:03:02.106744039Z at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)2021-10-27T23:03:02.106754139Z 2021-10-27T23:03:02.106754139Z at java.base/java.lang.Thread.run(Thread.java:829) [...] 2021-10-28T06:31:22.034446019Z [2021-10-28T06:31:22,034][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336651] overhead, spent [282ms] collecting in the last [1s] 2021-10-28T06:31:53.958301553Z [2021-10-28T06:31:53,933][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336682] overhead, spent [263ms] collecting in the last [1s] 2021-10-28T06:33:56.385598998Z [2021-10-28T06:33:56,377][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336801] overhead, spent [265ms] collecting in the last [1s] 2021-10-28T06:35:13.681730622Z [2021-10-28T06:35:13,681][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336876] overhead, spent [516ms] collecting in the last [1s] 2021-10-28T06:35:38.848591269Z [2021-10-28T06:35:38,829][WARN ][r.suppressed ] [elasticsearch-cd-a76lica6-2] path: /_prometheus/metrics, params: {} 2021-10-28T06:35:38.848591269Z org.elasticsearch.ElasticsearchException: Search request failed 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.TransportNodePrometheusMetricsAction$AsyncAction$1.onFailure(TransportNodePrometheusMetricsAction.java:160) ~[?:?] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:238) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.AbstractRunnable.onRejection(AbstractRunnable.java:63) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.TimedRunnable.onRejection(TimedRunnable.java:50) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onRejection(ThreadContext.java:741) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:104) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.execute(AbstractSearchAsyncAction.java:311) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.FetchSearchPhase.run(FetchSearchPhase.java:80) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:165) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase.successfulShardExecution(InitialSearchPhase.java:254) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase.onShardResult(InitialSearchPhase.java:242) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$innerOnResponse$0(InitialSearchPhase.java:215) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:187) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] 2021-10-28T06:35:38.848591269Z at java.lang.Thread.run(Thread.java:829) ~[?:?] 2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: 2021-10-28T06:35:38.848591269Z ... 23 more 2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5ab14b9 on QueueResizingEsThreadPoolExecutor[name = elasticsearch-cd-a76lica6-2/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.8s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@e87b0fa[Running, pool size = 19, active threads = 19, queued tasks = 1124, completed tasks = 23997624]] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z ... 17 more 2021-10-28T06:36:16.719970982Z [2021-10-28T06:36:16,714][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336937] overhead, spent [269ms] collecting in the last [1s] 2021-10-28T06:37:45.756689890Z [2021-10-28T06:37:45,756][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337025] overhead, spent [270ms] collecting in the last [1s] 2021-10-28T06:38:02.045178190Z [2021-10-28T06:38:02,045][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337041] overhead, spent [375ms] collecting in the last [1.1s] 2021-10-28T06:38:21.008915287Z [2021-10-28T06:38:21,003][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337059] overhead, spent [340ms] collecting in the last [1s] 2021-10-28T06:39:04.040390131Z [2021-10-28T06:39:04,040][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337101] overhead, spent [259ms] collecting in the last [1s] 2021-10-28T06:39:11.170761807Z [2021-10-28T06:39:11,170][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337108] overhead, spent [446ms] collecting in the last [1s] 2021-10-28T06:40:11.338977208Z [2021-10-28T06:40:11,330][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337167] overhead, spent [472ms] collecting in the last [1s] 2021-10-28T06:41:09.358445472Z [2021-10-28T06:41:09,358][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337223] overhead, spent [339ms] collecting in the last [1.2s] 2021-10-28T06:43:10.781494464Z [2021-10-28T06:43:10,781][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337340] overhead, spent [477ms] collecting in the last [1s] 2021-10-28T06:44:14.139935462Z [2021-10-28T06:44:14,139][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337402] overhead, spent [442ms] collecting in the last [1s] 2021-10-28T06:45:07.767251526Z [2021-10-28T06:45:07,761][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337454] overhead, spent [251ms] collecting in the last [1s] 2021-10-28T06:45:14.793049841Z [2021-10-28T06:45:14,790][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337461] overhead, spent [354ms] collecting in the last [1s] 2021-10-28T06:45:53.615394345Z [2021-10-28T06:45:53,615][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337499] overhead, spent [268ms] collecting in the last [1s] 2021-10-28T06:46:12.412161308Z [2021-10-28T06:46:12,409][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337517] overhead, spent [353ms] collecting in the last [1.1s] 2021-10-28T06:46:17.420294317Z [2021-10-28T06:46:17,420][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337522] overhead, spent [290ms] collecting in the last [1s] 2021-10-28T06:48:16.348584435Z [2021-10-28T06:48:16,348][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337638] overhead, spent [495ms] collecting in the last [1.2s] 2021-10-28T06:49:18.555172931Z [2021-10-28T06:49:18,554][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337699] overhead, spent [257ms] collecting in the last [1s] 2021-10-28T06:49:46.192533578Z [2021-10-28T06:49:46,192][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337726] overhead, spent [345ms] collecting in the last [1s] 2021-10-28T06:50:21.424803595Z [2021-10-28T06:50:21,424][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337760] overhead, spent [400ms] collecting in the last [1.2s] 2021-10-28T06:51:25.162650797Z [2021-10-28T06:51:25,162][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1337822] overhead, spent [396ms] collecting in the last [1s]
Based on the error it looked like not enough resources were available. But it seemed strange that this would happen out of the blue and only affect one elasticsearch pod.
Anyway, the elasticsearch pod restarted to get everything back and then the interesting effect happen. While the problematic elasticsearch pod came back and reported ready, we saw immediately another elasticsearch pod going not ready and reporting the exact same problem in the logs and again CPU usage went up crazy.
So we ended up restarting one elasticsearch pod after the other as we constantly saw the effect moving around. Only when scaling all elasticsearch deployments to 0 and getting them back up we were able to recover the situation. After that all appears to be stable but of course this behavior is nasty and we need to understand what is happening here and how to prevent that.
As mentioned this happen on two different environments in the past two days and we are certain that we have enough CPU and Memory resources available.