使用 Elasticsearch 查询字段的所有唯一值

Posted 2023-02-23

技术标签:

【中文标题】使用 Elasticsearch 查询字段的所有唯一值【英文标题】：Query all unique values of a field with Elasticsearch 【发布时间】：2013-01-06 03:20:05 【问题描述】：

如何使用 Elasticsearch 搜索给定字段的所有唯一值？

我有类似select full_name from authors 这样的查询，所以我可以在表单上向用户显示列表。

【问题讨论】：

【参考方案1】：

您可以在“全名”字段中创建terms facet。但是为了正确地做到这一点，您需要确保在索引时没有对其进行标记，否则构面中的每个条目都将是作为字段内容一部分的不同术语。您很可能需要在映射中将其配置为“not_analyzed”。如果您也在搜索它并且仍想对其进行标记，则可以使用multi field 以两种不同的方式对其进行索引。

您还需要考虑到，根据作为 full_name 字段一部分的唯一术语的数量，此操作可能很昂贵并且需要相当多的内存。

【讨论】：

【参考方案2】：

对于 Elasticsearch 1.0 及更高版本，您可以利用 terms aggregation 来执行此操作，

查询 DSL：


  "aggs": 
    "NAME": 
      "terms": 
        "field": "",
        "size": 10

一个真实的例子：


  "aggs": 
    "full_name": 
      "terms": 
        "field": "authors",
        "size": 0

然后你可以得到authors字段的所有唯一值。 size=0 表示不限制词条数（这要求 es 为 1.1.0 或更高版本）。

回复：


    ...

    "aggregations" : 
        "full_name" : 
            "buckets" : [
                
                    "key" : "Ken",
                    "doc_count" : 10
                ,
                
                    "key" : "Jim Gray",
                    "doc_count" : 10
                ,
            ]

见Elasticsearch terms aggregations。

【讨论】：

全名是什么意思？ @neustart47 full_name 只是聚合的名称【参考方案3】：

现有的答案在 Elasticsearch 5.X 中对我不起作用，原因如下：

我需要在索引时标记我的输入。 "size": 0 解析失败，因为“[size] 必须大于 0。” "Fielddata is disabled on text fields by default." 这意味着默认情况下您无法搜索 full_name 字段。但是，未分析的 keyword 字段可用于聚合。

解决方案 1：使用 Scroll API。它通过保持搜索上下文并发出多个请求来工作，每次都返回后续批次的结果。如果您使用的是 Python，则 elasticsearch 模块具有 scan() helper function 来为您处理滚动并返回所有结果。

解决方案 2：使用 Search After API。它类似于 Scroll，但提供了一个实时光标而不是保留搜索上下文。因此对于实时请求更有效。

【讨论】：

我不确定这是否能解决 "size":0 问题，因为我从文档中看到的默认值是 10...【参考方案4】：

为 Elasticsearch 5.2.2 工作

curl -XGET  http://localhost:9200/articles/_search?pretty -d '

    "aggs" : 
        "whatever" : 
            "terms" :  "field" : "yourfield", "size":10000 
        
    ,
    "size" : 0
'

"size":10000 表示（最多）获得 10000 个唯一值。如果没有这个，如果您有超过 10 个唯一值，则仅返回 10 个值。

"size":0 表示结果中，"hits" 将不包含任何文档。默认返回 10 个文档，我们不需要。

参考：bucket terms aggregation

另外请注意，根据this page，在 Elasticsearch 1.0 中，facets 已被聚合取代，这是 facets 的超集。

【讨论】：

【参考方案5】：

直觉： 用 SQL 的说法：

Select distinct full_name from authors;

等价于

Select full_name from authors group by full_name;

因此，我们可以使用 ElasticSearch 中的分组/聚合语法来查找不同的条目。

假设弹性搜索中存储的结构如下：

[
    "author": "Brian Kernighan"
  ,
  
    "author": "Charles Dickens"
  ]

什么不起作用：普通聚合


  "aggs": 
    "full_name": 
      "terms": 
        "field": "author"

我收到以下错误：


  "error": 
    "root_cause": [
      
        "reason": "Fielddata is disabled on text fields by default...",
        "type": "illegal_argument_exception"
      
    ]

什么是魅力：在字段中附加 .keyword


  "aggs": 
    "full_name": 
      "terms": 
        "field": "author.keyword"

样本输出可能是：


  "aggregations": 
    "full_name": 
      "buckets": [
        
          "doc_count": 372,
          "key": "Charles Dickens"
        ,
        
          "doc_count": 283,
          "key": "Brian Kernighan"
        
      ],
      "doc_count": 1000

额外提示：

让我们假设有问题的字段嵌套如下：

[
    "authors": [
        "details": [
            "name": "Brian Kernighan"
          ]
      ]
  ,
  
    "authors": [
        "details": [
            "name": "Charles Dickens"
          ]
      ]
  
]

现在正确的查询变成：


  "aggregations": 
    "full_name": 
      "aggregations": 
        "author_details": 
          "terms": 
            "field": "authors.details.name"
          
        
      ,
      "nested": 
        "path": "authors.details"
      
    
  ,
  "size": 0

【讨论】：

以上是关于使用 Elasticsearch 查询字段的所有唯一值的主要内容，如果未能解决你的问题，请参考以下文章