ElasticSearch 按多个字段分组

Posted

技术标签:

【中文标题】ElasticSearch 按多个字段分组【英文标题】:ElasticSearch group by multiple fields 【发布时间】:2013-08-29 06:31:33 【问题描述】:

我发现的唯一接近的东西是:Multiple group-by in Elasticsearch

基本上我正在尝试获得与以下mysql 查询等效的 ES:

select gender, age_range, count(distinct profile_id) as count 
FROM TABLE group by age_range, gender

年龄和性别本身很容易获得:


  "query": 
    "match_all": 
  ,
  "facets": 
    "ages": 
      "terms": 
        "field": "age_range",
        "size": 20
      
    ,
    "gender_by_age": 
      "terms": 
        "fields": [
          "age_range",
          "gender"
        ]
      
    
  ,
  "size": 0

给出:


  "ages": 
    "_type": "terms",
    "missing": 0,
    "total": 193961,
    "other": 0,
    "terms": [
      
        "term": 0,
        "count": 162643
      ,
      
        "term": 3,
        "count": 10683
      ,
      
        "term": 4,
        "count": 8931
      ,
      
        "term": 5,
        "count": 4690
      ,
      
        "term": 6,
        "count": 3647
      ,
      
        "term": 2,
        "count": 3247
      ,
      
        "term": 1,
        "count": 120
      
    ]
  ,
  "total_gender": 
    "_type": "terms",
    "missing": 0,
    "total": 193961,
    "other": 0,
    "terms": [
      
        "term": 1,
        "count": 94799
      ,
      
        "term": 2,
        "count": 62645
      ,
      
        "term": 0,
        "count": 36517
      
    ]
  

但现在我需要一些看起来像这样的东西:

[breakdown_gender] => Array
    (
        [1] => Array
            (
                [0] => 264
                [1] => 1
                [2] => 6
                [3] => 67
                [4] => 72
                [5] => 40
                [6] => 23
            )

        [2] => Array
            (
                [0] => 153
                [2] => 2
                [3] => 21
                [4] => 35
                [5] => 22
                [6] => 11
            )

    )

请注意,0,1,2,3,4,5,6 是年龄范围的“映射”,因此它们实际上意味着:) 而不仅仅是数字。例如性别[1](“男性”)细分为年龄范围 [0](“18 岁以下”),计数为 246。

【问题讨论】:

当我使用 curl 3 "error" : "root_cause" : [ "type" : "parsing_exception", "reason" : "Unknown key for a START_OBJECT in [facets].", "line" : 6, "col" : 13 ], "type" : "parsing_exception", "reason" : "[facets] 中 START_OBJECT 的未知键。", "line" : 6 , "col" : 13 , "status" : 400 【参考方案1】:

ElasticSearch 1.0 版开始,新的aggregations API 允许使用子聚合按多个字段进行分组。假设您要按字段field1field2field3 分组:


  "aggs": 
    "agg1": 
      "terms": 
        "field": "field1"
      ,
      "aggs": 
        "agg2": 
          "terms": 
            "field": "field2"
          ,
          "aggs": 
            "agg3": 
              "terms": 
                "field": "field3"
              
            
                    
        
      
    
  

当然,这可以在任意多个领域进行。

更新: 为了完整起见,以下是上述查询的输出外观。下面还有用于生成聚合查询并将结果展平为字典列表的 python 代码。


  "aggregations": 
    "agg1": 
      "buckets": [
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": 
          "buckets": [
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": 
              "buckets": [
                "doc_count": <count>,
                "key": <value of field3>
              ,
              
                "doc_count": <count>,
                "key": <value of field3>
              , ...
              ]
            ,
            
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": 
              "buckets": [
                "doc_count": <count>,
                "key": <value of field3>
              ,
              
                "doc_count": <count>,
                "key": <value of field3>
              , ...
              ]
            , ...
          ]
        ,
        
        "doc_count": <count>,
        "key": <value of field1>,
        "agg2": 
          "buckets": [
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": 
              "buckets": [
                "doc_count": <count>,
                "key": <value of field3>
              ,
              
                "doc_count": <count>,
                "key": <value of field3>
              , ...
              ]
            ,
            
            "doc_count": <count>,
            "key": <value of field2>,
            "agg3": 
              "buckets": [
                "doc_count": <count>,
                "key": <value of field3>
              ,
              
                "doc_count": <count>,
                "key": <value of field3>
              , ...
              ]
            , ...
          ]
        , ...
      ]
    
  

以下 python 代码在给定字段列表的情况下执行分组。如果您指定include_missing=True,它还包括缺少某些字段的值组合(如果您有 Elasticsearch 2.0 版,则不需要它,感谢this)

def group_by(es, fields, include_missing):
    current_level_terms = 'terms': 'field': fields[0]
    agg_spec = fields[0]: current_level_terms

    if include_missing:
        current_level_missing = 'missing': 'field': fields[0]
        agg_spec[fields[0] + '_missing'] = current_level_missing

    for field in fields[1:]:
        next_level_terms = 'terms': 'field': field
        current_level_terms['aggs'] = 
            field: next_level_terms,
        

        if include_missing:
            next_level_missing = 'missing': 'field': field
            current_level_terms['aggs'][field + '_missing'] = next_level_missing
            current_level_missing['aggs'] = 
                field: next_level_terms,
                field + '_missing': next_level_missing,
            
            current_level_missing = next_level_missing

        current_level_terms = next_level_terms

    agg_result = es.search(body='aggs': agg_spec)['aggregations']
    return get_docs_from_agg_result(agg_result, fields, include_missing)


def get_docs_from_agg_result(agg_result, fields, include_missing):
    current_field = fields[0]
    buckets = agg_result[current_field]['buckets']
    if include_missing:
        buckets.append(agg_result[(current_field + '_missing')])

    if len(fields) == 1:
        return [
            
                current_field: bucket.get('key'),
                'doc_count': bucket['doc_count'],
            
            for bucket in buckets if bucket['doc_count'] > 0
        ]

    result = []
    for bucket in buckets:
        records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
        value = bucket.get('key')
        for record in records:
            record[current_field] = value
        result.extend(records)

    return result

【讨论】:

我收到类似 Unrecognized token "my fields value" 的错误。我该如何解决这个问题? 对大数据使用子聚合并通过简单的编码更改它对两列表的响应格式,可能需要相当长的时间。还有其他方法吗? @HappyCoder - 你能添加更多关于你遇到的问题的细节吗?例如 - 您使用的查询是什么? @MakanTayebi - 请问您使用的是哪种编程语言? 我正在使用 php 进行编码。我可以使用 C 模块处理这个特定任务,但我当然更喜欢 elasticsearch 自己来完成。【参考方案2】:

由于您只有 2 个字段,因此一种简单的方法是使用单个方面进行两个查询。男性:


    "query" : 
      "term" :  "gender" : "Male" 
    ,
    "facets" : 
        "age_range" : 
            "terms" : 
                "field" : "age_range"
            
        
    

对于女性:


    "query" : 
      "term" :  "gender" : "Female" 
    ,
    "facets" : 
        "age_range" : 
            "terms" : 
                "field" : "age_range"
            
        
    

或者您可以在单个查询中使用构面过滤器来执行此操作(有关详细信息,请参阅this link)


    "query" : 
       "match_all": 
    ,
    "facets" : 
        "age_range_male" : 
            "terms" : 
                "field" : "age_range"
            ,
            "facet_filter":
                "term": 
                    "gender": "Male"
                
            
        ,
        "age_range_female" : 
            "terms" : 
                "field" : "age_range"
            ,
            "facet_filter":
                "term": 
                    "gender": "Female"
                
            
        
    

更新:

因为构面即将被删除。这是聚合的解决方案:


  "query": 
    "match_all": 
  ,
  "aggs": 
    "male": 
      "filter": 
        "term": 
          "gender": "Male"
        
      ,
      "aggs": 
        "age_range": 
          "terms": 
            "field": "age_range"
          
        
      
    ,
    "female": 
      "filter": 
        "term": 
          "gender": "Female"
        
      ,
      "aggs": 
        "age_range": 
          "terms": 
            "field": "age_range"
          
        
      
    
  

【讨论】:

截至 2015 年 10 月 28 日星期三,elasticsearch 官网声明“Facets 已被弃用,将在未来的版本中删除。鼓励您改为迁移到聚合”。 我可以将 date_histogram 作为一个聚合吗?

以上是关于ElasticSearch 按多个字段分组的主要内容,如果未能解决你的问题,请参考以下文章

mysql 按表达式或函数分组多个字段分组排序

LINQ 按多个字段分组 - 语法帮助

Java Streams API - 按多个字段分组

SQL - 选择按多个字段分组的前 n 个,按计数排序

使用多个选项按字段对 Drupal 视图进行分组 - 仅显示一个字段

按多个字段分组并使用 Dart 和 Flutter 获取最大值