干货 | Elasticsearch Nested 数组大小求解，一网打尽！

Posted 2022-03-10 铭毅天下

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了干货 | Elasticsearch Nested 数组大小求解，一网打尽！相关的知识，希望对你有一定的参考价值。

1、实战线上 Nested 问题

如何查询所有 objectList （Nested 类型）里面的 lossStatus="ENABLE" 且 objectList 的数组大小大于2的数据?

——问题来源：死磕Elasticsearch 知识星球

2、数据模型

索引导入和样例数据批量写入如下所示。

PUT appweb

  "mappings": 
    "properties": 
      "name": 
        "type": "text"
      ,
      "orderTime": 
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      ,
      "objectList": 
        "type": "nested",
        "properties": 
          "addTime": 
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss"
          ,
          "customerPersonId": 
            "type": "long"
          ,
          "lossStatus": 
            "type": "text"
          
        
      
    
  



POST appweb/_bulk
"index":"_id":1
"name":"111","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE","addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE"]
"index":"_id":2
"name":"222","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":201,"lossStatus":"2222","addTime":"2022-02-02 02:02:02","customerPersonId":202,"lossStatus":"2222","addTime":"2022-02-02 02:02:02","customerPersonId":203,"lossStatus":"3333"]
"index":"_id":3
"name":"111","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"]
"index":"_id":4
"name":"111","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE","addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE","addTime":"2022-02-02 02:02:02","customerPersonId":103,"lossStatus":"ENABLE"]

开搞，方案逐步展开讨论。

3、问题拆解

涉及三个核心知识点：

其一：检索数据涉及 Nested 类型。
其二：检索条件1：objectList （Nested 类型）下的 lossStatus="ENABLE"。

这个在检索的时候要注意指定 path，否则会报错。

其三：检索条件2：获取 objectList 的数组大小大于 2 的数据?

问题转化为：检索条件1、检索条件2的组合实现。

3.1 检索条件 1 实现

POST appweb/_search

  "query": 
    "bool": 
      "must": [
        
          "nested": 
            "path": "objectList",
            "query": 
              "match_phrase": 
                "objectList.lossStatus": "ENABLE"
              
            
          
        
      ]

中规中矩的 Nested 语法，无需过多解释。唯一强调的是：path的用法。

如果 Nested 语法不熟悉，可以参考官方文档：

https://www.elastic.co/guide/en/elasticsearch/reference/8.0/query-dsl-nested-query.html

3.2 检索条件 2 实现

本质是获取 objectList 的数组大小大于 2 的数据。再进一步缩小范围是：获取 objectList 数组的大小。

问题转化为如何获取 Nested 嵌套类型数组大小？

这里的确没有非常现成的实现，我总结了如下几种方案。

方案1：function_score 检索实现

该方案包含了：3.1 小节检索条件 1 的实现，完整实现如下。

POST appweb/_search

  "query": 
    "bool": 
      "must": [
        
          "nested": 
            "path": "objectList",
            "query": 
              "match_phrase": 
                "objectList.lossStatus": "ENABLE"
              
            
          
        ,
        
          "function_score": 
            "query": 
              "match_all": 
            ,
            "functions": [
              
                "script_score": 
                  "script": 
                    "source": "params._source.containsKey('objectList') && params._source['objectList'] != null && params._source.objectList.size() > 2 ? 2 : 0"
                  
                
              
            ],
            "min_score": 1
          
        
      ]

注意在 script_score 下做了多条件判断：

params._source.containsKey('objectList') 
params._source['objectList'] != null
params._source.objectList.size() > 2

官方语法参考：

https://www.elastic.co/guide/en/elasticsearch/reference/8.0/query-dsl-function-score-query.html

https://www.elastic.co/guide/en/elasticsearch/painless/8.0/painless-score-context.html

方案2：funciton_score 检索实现2

POST appweb/_search

  "query": 
    "function_score": 
      "query": 
        "bool": 
          "must": [
            
              "nested": 
                "path": "objectList",
                "query": 
                  "exists": 
                    "field": "objectList.customerPersonId"
                  
                ,
                "score_mode": "sum"
              
            ,
            
              "nested": 
                "path": "objectList",
                "query": 
                  "match_phrase": 
                    "objectList.lossStatus": "ENABLE"
                  
                
              
            
          ]
        
      ,
      "functions": [
        
          "script_score": 
            "script": 
              "source": "_score >= 3 ? 1 : 0"
            
          
        
      ],
      "boost_mode": "replace"
    
  ,
  "min_score": 1

该方式本质是曲线救国，借助：sum 求和累加评分实现。

实现条件是：存在字段“objectList.customerPersonId”，评分就高。该方式不太容易想到，“可遇而不可求”。

方案3：runtime_field 运行时字段实现

POST appweb/_search

  "runtime_mappings": 
    "objectList_tmp": 
      "type": "keyword",
      "script": """
        int genre = params['_source']['objectList'].size();
        emit(genre.toString());
      """
    
  ,
  "query": 
    "bool": 
      "must": [
        
          "nested": 
            "path": "objectList",
            "query": 
              "match_phrase": 
                "objectList.lossStatus": "ENABLE"
              
            
          
        ,
        
          "range": 
            "objectList_tmp": 
              "gte": 3
            
          
        
      ]

这是我整合了聚合 + runtime_field 实现的结果，召回结果达到预期且令人满意。

最后发现聚合部分是多余的，删除之。

解读如下：

第一：新加了运行时字段——objectList_tmp，目的：获取 Nested 数组大小。
第二：结合已有 nested 检索组合 bool 检索实现即可。

综合对比看，它比下面的方案4更简洁，如果线上环境想不修改数据的前提下使用，推荐此方案。

方案4：聚合实现

GET appweb/_search

  "size": 0,
  "query": 
    "nested": 
      "path": "objectList",
      "query": 
        "match_phrase": 
          "objectList.lossStatus": "ENABLE"
        
      
    
  ,
  "aggs": 
    "counts_aggs": 
      "terms": 
        "script": "params['_source']['objectList'].size()"
      ,
      "aggs": 
        "top_hits_aggs": 
          "top_hits": 
            "size": 10

对比方案 3，方案 4相对鸡肋和繁冗、复杂。

也更进一步体会：runtime_field 的妙处。

4、换个思路？轻装上阵！

什么思路？之前文章有过解读——空间换时间。

具体实现如下：

4.1 步骤1：预处理新增字段 nested_size。

PUT  _ingest/pipeline/add_nested_size_pipeline

  "processors": [
    
      "script": 
        "lang": "painless",
        "source": "ctx.nested_size = ctx.objectList.size();"
      
    
  ]

4.2 步骤2：创建索引且导入数据。

创建索引同时指定步骤 1 的 pipeline 预处理管道。

PUT appweb_ext

  "settings": 
    "index": 
      "default_pipeline": "add_nested_size_pipeline"
    
  ,
  "mappings": 
    "properties": 
      "name": 
        "type": "text"
      ,
      "orderTime": 
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss"
      ,
      "objectList": 
        "type": "nested",
        "properties": 
          "addTime": 
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss"
          ,
          "customerPersonId": 
            "type": "long"
          ,
          "lossStatus": 
            "type": "text"
          
        
      
    
  




POST appweb_ext/_bulk
"index":"_id":1
"name":"111","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE","addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE"]
"index":"_id":2
"name":"222","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":201,"lossStatus":"2222","addTime":"2022-02-02 02:02:02","customerPersonId":202,"lossStatus":"2222","addTime":"2022-02-02 02:02:02","customerPersonId":203,"lossStatus":"3333"]
"index":"_id":3
"name":"111","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE"]
"index":"_id":4
"name":"111","orderTime":"2022-02-02 02:02:02","objectList":["addTime":"2022-02-02 02:02:02","customerPersonId":101,"lossStatus":"ENABLE","addTime":"2022-02-02 02:02:02","customerPersonId":102,"lossStatus":"ENABLE","addTime":"2022-02-02 02:02:02","customerPersonId":103,"lossStatus":"ENABLE"]

4.3 步骤3：复杂脚本检索变成简单检索实现。

bool 组合条件，一个 nested 检索 + 一个 range query，轻松搞定！

POST appweb_ext/_search

  "query": 
    "bool": 
      "must": [
        
          "nested": 
            "path": "objectList",
            "query": 
              "match_phrase": 
                "objectList.lossStatus": "ENABLE"
              
            
          
        ,
        
          "range": 
            "nested_size": 
              "gt": 2
            
          
        
      ]

此方案是我极力推广的方案，需要我们多结合业务实际，多在数据写入前的设计阶段、数据建模阶段做“文章”。而不是快速导入数据，后面丢给复杂的检索脚本实现。

一般项目实战阶段，很多人会说，“工期要紧，我管不了那么多”。项目后期复盘会发现，“看似快了，实则慢了”，最终感叹：“预处理的工作不要省也不能省”！

5、小结

看似简单的几个方案，我从入手到梳理完毕耗时大于 6 个小时+。主要是painless 脚本没有固定的章法可循，需要摸索和反复验证。

意外收获是方案3，基于方案 4 的创新方案，比较灵活好用。

但，我更推荐空间换时间的方案。能预处理搞定的事情，就不要留到检索阶段实现。

欢迎留言说下您的方案和思考！

6、参考

https://stackoverflow.com/questions/64447956

https://stackoverflow.com/questions/54022283

https://stackoverflow.com/questions/57144172

https://t.zsxq.com/FAQ7mUN

https://www.ru-rocker.com/2020/11/03/filtering-nested-array-objects-in-elasticsearch-document-with-painless-scripting/

https://medium.com/@felipegirotti/elasticsearch-filter-field-array-more-than-zero-8d52d067d3a0