ElasticSearch集群状态异常(RedYellow)原因分析

Posted 2022-04-22 努力者Mr李

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了ElasticSearch集群状态异常(RedYellow)原因分析相关的知识，希望对你有一定的参考价值。

注：部分概念介绍来源于网络

一、ElasticSearch集群的三种状态：
Green - 所有数据都可用，主副分片都已经分配好
Yellow - 所有数据都可用，但尚未分配一些副本，不影响查询，可能影响恢复。如果集群中的某个节点发生故障，则在修复该节点之前，某些数据可能不可用。
Red - 某些数据由于某种原因存在主分片未分配，对查询会有影响

二、查询索引Yellow状态原因
1、查看集群的健康并显示索引状态

GET /_cluster/health?level=indices

  "cluster_name" : "elasticsearch-1",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  #活动主分区数量
  "active_primary_shards" : 28,
  #活动主分区和副本分区的总数
  "active_shards" : 55,
  #正在重定位的分片数量
  "relocating_shards" : 0,
  #正在初始化的分片数量
  "initializing_shards" : 0,
  #未分配的分片数
  "unassigned_shards" : 3,
  #其分配因超时设置而延迟的分片数
  "delayed_unassigned_shards" : 0,
  #尚未执行的集群级别更改的数量
  "number_of_pending_tasks" : 0,
  #为完成的访问数量
  "number_of_in_flight_fetch" : 0,
  #自最早的初始化任务等待执行以来的时间(以毫秒为单位)
  "task_max_waiting_in_queue_millis" : 0,
  #集群中活动碎片的比率，以百分比表示
  "active_shards_percent_as_number" : 100.0,
  "indices" : 
    "elasticsearch-1" : 
      "status" : "green",
      "number_of_shards" : 3,
      "number_of_replicas" : 3,
      "active_primary_shards" : 5,
      "active_shards" : 10,
      "relocating_shards" : 0,
      "initializing_shards" : 0,
      "unassigned_shards" : 3

2、查看集群中每个节点的分片分配情况

GET /_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host      ip        node
    19       86.7kb    36.9gb     95.2gb    132.2gb           27 127.0.0.1 127.0.0.1 master
    18       73.1kb    36.9gb     95.2gb    132.2gb           27 127.0.0.1 127.0.0.1 node-003
    18       67.8kb    36.9gb     95.2gb    132.2gb           27 127.0.0.1 127.0.0.1 node-002
     3                                                                               UNASSIGNED
#unassigned_shards=3，确定是副本分片未分配，导致集群状态Yellow

3、查看unassigned的原因

GET /_cluster/allocation/explain?pretty

    "index" : "elasticsearch-1",
    "shard" : 3,
    "primary" : false,
    "current_state" : "unassigned",
    "unassigned_info" : 
        "reason" : "CLUSTER_RECOVERED",
        "at" : "2022-04-20T11:01:43.051Z",
        "last_allocation_status" : "no_attempt"
    ,
    "can_allocate" : "no",
    #异常原因
    "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
    "node_allocation_decisions" : [
    
        "node_id" : "NfmBH4nSSpGmtf7aPNuvXQ",
        "node_name" : "master",
        "transport_address" : "127.0.0.1:9300",
        "node_decision" : "no",
        "deciders" : [
        "decider" : "same_shard",
        "decision" : "NO",
        "explanation" : "the same cannot be allocate to the same node no which a copy of the shard already exists "
        ]
    ]

查看每个节点原因说有同样的数据，不能分配。
4、查看所有的分片

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason

5、修改索引副本数

PUT /elasticsearch-1/_settings

    "number_of_replicas": 2

6、更改完后查询

GET /_cluster/health?level=indices
  "unassigned_shards" : 0

三、总结(Red、Yellow)
遇到集群Red、Yellow时，我们可以从如下方法排查 :

集群层面：curl -s 172.31.30.28:9200/_cat/nodes 或者 GET /_cluster/health
索引层面：GET /_cluster/health?pretty&level=indices
分片层面：GET /_cluster/health?pretty&level=shards
恢复情况：GET /_recovery?pretty

1、有unassigned分片的排查思路：

先诊断：GET /_cluster/allocation/explain
#重新分配： /_cluster/reroute
实在无法分配，索引重建：
1.1、新建备份索引：
curl -XPUT ‘http://xxxx:9200/a_index_copy/‘ -d ‘ “settings”: “index”: “number_of_shards”:3, “number_of_replicas”:1   
1.2、通过reindex api将a_index数据copy到a_index_copy：
POST _reindex  "source":  "index": "a_index" , "dest":  "index": "a_index_copy", "op_type": "create"  
1.3、删除a_index索引，这个必须要先做，否则别名无法添加
curl -XDELETE 'http://xxxx:9200/a_index'
1.4、给a_index_copy添加别名a_index
curl -XPOST 'http://xxxx:9200/_aliases' -d '  "actions": [ "add": "index": "a_index_copy", "alias": "a_index" ] '

以上是关于ElasticSearch集群状态异常(RedYellow)原因分析的主要内容，如果未能解决你的问题，请参考以下文章