ElasticSearch中的某个index的状态显示为red的问题index显示Unassigned Shards

Posted 2021-05-29 to.to

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了ElasticSearch中的某个index的状态显示为red的问题index显示Unassigned Shards相关的知识，希望对你有一定的参考价值。

ElasticSearch中的某个index的状态显示为red的问题
错误：Unassigned Shards 4

1.1.1.查看集群状态

GET /_cluster/health?pretty

结果类似：

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 426,
  "active_shards" : 851,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

从上面可看出，集群的状态为red，其中unassigned_shard为4。错误原因就是有unassigned_shard的索引导致的。

1.1.2.查看索引的状态

GET /_cat/indices?v

结果类似：

health status index                               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test_test_operation_to              pyz9euqTQQ6GF0ulPsnX4g   3   1          1            0      9.2kb          4.6kb
green  open   cba1                                r5fvWeeAQ7uxQMRtNJxuwA   5   1          1            0     10.2kb          5.1kb
green  open   positiveinfo                        97r4mnToS1OVx04QVzF5Rw   3   1       3091            7    993.4kb        496.7kb
green  open   dc_rep_pub_issue_output_month       lLfxHpsZR8GPecqMLTvLsg   5   1     311845           87    163.3mb         81.5mb
       close  emplyee_test                        bsVDbqFWS0uYekpFI4Wnng                                                          
green  open   .monitoring-kibana-6-2021.05.27     cqlsx2crQyuc0WtSc_74zw   1   1       2711            0      1.8mb        959.1kb
green  open   filtertableinfo                     KGoc6kxqRtuZxPHG7Z6oXw   3   1         67            1      171kb         85.5kb
red  open   sg_house_rent_info_prod             fAVmV5aqTROVbHjqw0GRKg   5   1   60313716     16540955     19.7gb         10.2gb

查看health 为red的，可以定位到是：sg_house_rent_info_prod

1.1.3.查看每个节点分片的分配数量以及它们所使用的硬盘空间大小

我们通过 GET _cat/allocation?v 可以查看每个节点分片的分配数量以及它们所使用的硬盘空间大小

GET _cat/allocation?v

结果类似：

shards disk.indices disk.used disk.avail disk.total disk.percent host    ip            node
   284       88.2gb     218gb      2.7tb      2.9tb            7 hadoop1 xxx.xxx.xxx.xxx hadoop1
   284      104.7gb   248.9gb      2.7tb      2.9tb            8 hadoop3 xxx.xxx.xxx.xxx hadoop3
   283       96.6gb   234.6gb      2.8tb        3tb            7 hadoop2 xxx.xxx.xxx.xxx hadoop2
   4                                                                                     UNASSIGNED

发现其有4个shard是unassigned状态

再通过GET /_cat/health?v 查看集群健康状态。如果是正常的，显示的结果是如下的：

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1622101651 07:47:31  elasticsearch green           3         3    851 426    0    0        0             0                  -                100.0%

1.1.4.如何解决呢？

首先精确定位unassigned shard的位置

GET _cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

然后可以通过以下语句查看具体原因：

GET _cluster/allocation/explain?pretty

笔者查询出的结果是：

{
  "index" : "sg_house_rent_info_prod",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-05-24T20:47:04.790Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [w__xIKWBT5KJZg1CEcmFGA]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[sg_house_rent_info_prod][2]: obtaining shard lock timed out after 5000ms]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "BOIgtPqgQSyIfAICLDuEfQ",
      "node_name" : "hadoop1",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_attributes" : {
        "ml.machine_memory" : "269924302848",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "AxRuU1gfS3yimqXUd7SoJw"
      }
    },
    {
      "node_id" : "eGv9Jjs_S8GcNLKzkCxzMA",
      "node_name" : "hadoop2",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_attributes" : {
        "ml.machine_memory" : "269924302848",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "wjR9jkfjQ-28OBKl_xFi1A",
        "store_exception" : {
          "type" : "file_not_found_exception",
          "reason" : "no segments* file found in SimpleFSDirectory@/home/admin/es/esdata/nodes/0/indices/fAVmV5aqTROVbHjqw0GRKg/2/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@473b450: files: [write.lock]"
        }
      }
    },
    {
      "node_id" : "w__xIKWBT5KJZg1CEcmFGA",
      "node_name" : "hadoop3",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_attributes" : {
        "ml.machine_memory" : "269924302848",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "PQWOPdxnQDqVfQbRLgh32A",
        "store_exception" : {
          "type" : "shard_lock_obtain_failed_exception",
          "reason" : "[sg_house_rent_info_prod][2]: obtaining shard lock timed out after 5000ms",
          "index_uuid" : "fAVmV5aqTROVbHjqw0GRKg",
          "shard" : "2",
          "index" : "sg_house_rent_info_prod"
        }
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-05-24T20:47:04.790Z], failed_attempts[5], delayed=false, details[failed shard on node [w__xIKWBT5KJZg1CEcmFGA]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[sg_house_rent_info_prod][2]: obtaining shard lock timed out after 5000ms]; ], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

到网上查找failed to create shard, failure IOException[failed to obtain in-memory shard lock

解决办法：
在kibana中执行如下命令：

POST /_cluster/reroute?retry_failed=true

retry_failed ：（可选，布尔值）如果为true，则重试由于后续分配失败过多而阻塞的分片的分配。

附：常见es分配失败原因：

1）INDEX_CREATED：由于创建索引的API导致未分配。
2）CLUSTER_RECOVERED ：由于完全集群恢复导致未分配。
3）INDEX_REOPENED ：由于打开open或关闭close一个索引导致未分配。
4）DANGLING_INDEX_IMPORTED ：由于导入dangling索引的结果导致未分配。
5）NEW_INDEX_RESTORED ：由于恢复到新索引导致未分配。
6）EXISTING_INDEX_RESTORED ：由于恢复到已关闭的索引导致未分配。
7）REPLICA_ADDED：由于显式添加副本分片导致未分配。
8）ALLOCATION_FAILED ：由于分片分配失败导致未分配。
9）NODE_LEFT ：由于承载该分片的节点离开集群导致未分配。
10）REINITIALIZED ：由于当分片从开始移动到初始化时导致未分配（例如，使用影子shadow副本分片）。
11）REROUTE_CANCELLED ：作为显式取消重新路由命令的结果取消分配。
12）REALLOCATED_REPLICA ：确定更好的副本位置被标定使用，导致现有的副本分配被取消，出现未分配。

另外一篇比较好的博文：
Elasticsearch 集群和索引健康状态及常见错误说明

以上是关于ElasticSearch中的某个index的状态显示为red的问题index显示Unassigned Shards的主要内容，如果未能解决你的问题，请参考以下文章