记一次ES 事故

Posted bohu83

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了记一次ES 事故相关的知识,希望对你有一定的参考价值。

从报警来看,业务报接口超时,同时es 错误日志也会提示:

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@6fbaf20b on EsThreadPoolExecutor[search, queue capacity
 = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1f159058[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 23765969104]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) ~[elasticsearch-5.2.2.jar:5.2.2]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) ~[?:1.8.0_71]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) ~[?:1.8.0_71]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.doExecute(EsThreadPoolExecutor.java:94) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:89) ~[elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1445) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1329) [elasticsearch-5.2.2.jar:5.2.2]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-5.2.2.jar:5.2.2]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) ~[netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) ~[netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341) ~[netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) ~[netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.7.Final.jar:4.1.7.Final]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_71]

常规操作:

重启es,es 客户端服务,没坚持多久有开始报错。

显示集群负载高,70%,通常在10%以下,状态为red, 意味部分主分片不可用。

紧急选择降级操作,排除历史数据几百G的大分片后恢复。

es 5.2 ,client 是TransportClient ,类似单例模式,排除是client 配置引发的问题

反思:

还是没确认系统负载高对应业务的具体索引,以及具体查询语句。对于底层掌握不够。

补充知识点:

1 为啥不能随意调整es的线程池参数

在并发查询量大的情况下,访问流量超过了集群中单个Elasticsearch实例的处理能力,Elasticsearch服务端会触发保护性的机制,这个跟硬件配置cpu的核数有关,调成几百估计系统就崩溃了。

核心关注:

索引(index):主要是索引数据和删除数据操作
搜索(search):主要是获取,统计和搜索操作 
批量操作(bulk):主要是对索引的批量操作
更新(refresh):主要是更新操作

官网介绍如下:

A node uses several thread pools to manage memory consumption. Queues associated with many of the thread pools enable pending requests to be held instead of discarded.

There are several thread pools, but the important ones include:

generic

For generic operations (for example, background node discovery). Thread pool type is scaling.

search

For count/search/suggest operations. Thread pool type is fixed with a size of int((# of allocated processors * 3) / 2) + 1, and queue_size of 1000.

search_throttled

For count/search/suggest/get operations on search_throttled indices. Thread pool type is fixed with a size of 1, and queue_size of 100.

search_coordination

For lightweight search-related coordination operations. Thread pool type is fixed with a size of a max of min(5, (# of allocated processors) / 2), and queue_size of 1000.

get

For get operations. Thread pool type is fixed with a size of # of allocated processors, queue_size of 1000.

analyze

For analyze requests. Thread pool type is fixed with a size of 1, queue size of 16.

write

For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of allocated processors, queue_size of 10000. The maximum size for this pool is 1 + # of allocated processors.

snapshot

For snapshot/restore operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(5, (# of allocated processors) / 2).

snapshot_meta

For snapshot repository metadata read operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(50, (# of allocated processors* 3)).

warmer

For segment warm-up operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(5, (# of allocated processors) / 2).

refresh

For refresh operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(10, (# of allocated processors) / 2).

fetch_shard_started

For listing shard states. Thread pool type is scaling with keep-alive of 5m and a default maximum size of 2 * # of allocated processors.

fetch_shard_store

For listing shard stores. Thread pool type is scaling with keep-alive of 5m and a default maximum size of 2 * # of allocated processors.

flush

For flush and translog fsync operations. Thread pool type is scaling with a keep-alive of 5m and a default maximum size of min(5, (# of allocated processors) / 2).

force_merge

For force merge operations. Thread pool type is fixed with a size of 1 and an unbounded queue size.

management

For cluster management. Thread pool type is scaling with a keep-alive of 5m and a default maximum size of 5.

system_read

For read operations on system indices. Thread pool type is fixed with a default maximum size of min(5, (# of allocated processors) / 2).

system_write

For write operations on system indices. Thread pool type is fixed with a default maximum size of min(5, (# of allocated processors) / 2).

system_critical_read

For critical read operations on system indices. Thread pool type is fixed with a default maximum size of min(5, (# of allocated processors) / 2).

system_critical_write

For critical write operations on system indices. Thread pool type is fixed with a default maximum size of min(5, (# of allocated processors) / 2).

watcher

For watch executions. Thread pool type is fixed with a default maximum size of min(5 * (# of allocated processors), 50) and queue_size of 1000.

 

以上是关于记一次ES 事故的主要内容,如果未能解决你的问题,请参考以下文章

Mongodb---记一次事故故障

记一次Spring配置事故

记一次最近生产环境项目中发生的两个事故及处理方法

记一次线上事故

记一次线上事故

记一次坑逼的线上事故