ElasticSearch 按多个字段分组
Posted
技术标签:
【中文标题】ElasticSearch 按多个字段分组【英文标题】:ElasticSearch group by multiple fields 【发布时间】:2013-08-29 06:31:33 【问题描述】:我发现的唯一接近的东西是:Multiple group-by in Elasticsearch
基本上我正在尝试获得与以下mysql
查询等效的 ES:
select gender, age_range, count(distinct profile_id) as count
FROM TABLE group by age_range, gender
年龄和性别本身很容易获得:
"query":
"match_all":
,
"facets":
"ages":
"terms":
"field": "age_range",
"size": 20
,
"gender_by_age":
"terms":
"fields": [
"age_range",
"gender"
]
,
"size": 0
给出:
"ages":
"_type": "terms",
"missing": 0,
"total": 193961,
"other": 0,
"terms": [
"term": 0,
"count": 162643
,
"term": 3,
"count": 10683
,
"term": 4,
"count": 8931
,
"term": 5,
"count": 4690
,
"term": 6,
"count": 3647
,
"term": 2,
"count": 3247
,
"term": 1,
"count": 120
]
,
"total_gender":
"_type": "terms",
"missing": 0,
"total": 193961,
"other": 0,
"terms": [
"term": 1,
"count": 94799
,
"term": 2,
"count": 62645
,
"term": 0,
"count": 36517
]
但现在我需要一些看起来像这样的东西:
[breakdown_gender] => Array
(
[1] => Array
(
[0] => 264
[1] => 1
[2] => 6
[3] => 67
[4] => 72
[5] => 40
[6] => 23
)
[2] => Array
(
[0] => 153
[2] => 2
[3] => 21
[4] => 35
[5] => 22
[6] => 11
)
)
请注意,0,1,2,3,4,5,6
是年龄范围的“映射”,因此它们实际上意味着:) 而不仅仅是数字。例如性别[1](“男性”)细分为年龄范围 [0](“18 岁以下”),计数为 246。
【问题讨论】:
当我使用 curl 3 "error" : "root_cause" : [ "type" : "parsing_exception", "reason" : "Unknown key for a START_OBJECT in [facets].", "line" : 6, "col" : 13 ], "type" : "parsing_exception", "reason" : "[facets] 中 START_OBJECT 的未知键。", "line" : 6 , "col" : 13 , "status" : 400 【参考方案1】:从ElasticSearch
1.0 版开始,新的aggregations API 允许使用子聚合按多个字段进行分组。假设您要按字段field1
、field2
和field3
分组:
"aggs":
"agg1":
"terms":
"field": "field1"
,
"aggs":
"agg2":
"terms":
"field": "field2"
,
"aggs":
"agg3":
"terms":
"field": "field3"
当然,这可以在任意多个领域进行。
更新: 为了完整起见,以下是上述查询的输出外观。下面还有用于生成聚合查询并将结果展平为字典列表的 python 代码。
"aggregations":
"agg1":
"buckets": [
"doc_count": <count>,
"key": <value of field1>,
"agg2":
"buckets": [
"doc_count": <count>,
"key": <value of field2>,
"agg3":
"buckets": [
"doc_count": <count>,
"key": <value of field3>
,
"doc_count": <count>,
"key": <value of field3>
, ...
]
,
"doc_count": <count>,
"key": <value of field2>,
"agg3":
"buckets": [
"doc_count": <count>,
"key": <value of field3>
,
"doc_count": <count>,
"key": <value of field3>
, ...
]
, ...
]
,
"doc_count": <count>,
"key": <value of field1>,
"agg2":
"buckets": [
"doc_count": <count>,
"key": <value of field2>,
"agg3":
"buckets": [
"doc_count": <count>,
"key": <value of field3>
,
"doc_count": <count>,
"key": <value of field3>
, ...
]
,
"doc_count": <count>,
"key": <value of field2>,
"agg3":
"buckets": [
"doc_count": <count>,
"key": <value of field3>
,
"doc_count": <count>,
"key": <value of field3>
, ...
]
, ...
]
, ...
]
以下 python 代码在给定字段列表的情况下执行分组。如果您指定include_missing=True
,它还包括缺少某些字段的值组合(如果您有 Elasticsearch 2.0 版,则不需要它,感谢this)
def group_by(es, fields, include_missing):
current_level_terms = 'terms': 'field': fields[0]
agg_spec = fields[0]: current_level_terms
if include_missing:
current_level_missing = 'missing': 'field': fields[0]
agg_spec[fields[0] + '_missing'] = current_level_missing
for field in fields[1:]:
next_level_terms = 'terms': 'field': field
current_level_terms['aggs'] =
field: next_level_terms,
if include_missing:
next_level_missing = 'missing': 'field': field
current_level_terms['aggs'][field + '_missing'] = next_level_missing
current_level_missing['aggs'] =
field: next_level_terms,
field + '_missing': next_level_missing,
current_level_missing = next_level_missing
current_level_terms = next_level_terms
agg_result = es.search(body='aggs': agg_spec)['aggregations']
return get_docs_from_agg_result(agg_result, fields, include_missing)
def get_docs_from_agg_result(agg_result, fields, include_missing):
current_field = fields[0]
buckets = agg_result[current_field]['buckets']
if include_missing:
buckets.append(agg_result[(current_field + '_missing')])
if len(fields) == 1:
return [
current_field: bucket.get('key'),
'doc_count': bucket['doc_count'],
for bucket in buckets if bucket['doc_count'] > 0
]
result = []
for bucket in buckets:
records = get_docs_from_agg_result(bucket, fields[1:], include_missing)
value = bucket.get('key')
for record in records:
record[current_field] = value
result.extend(records)
return result
【讨论】:
我收到类似 Unrecognized token "my fields value" 的错误。我该如何解决这个问题? 对大数据使用子聚合并通过简单的编码更改它对两列表的响应格式,可能需要相当长的时间。还有其他方法吗? @HappyCoder - 你能添加更多关于你遇到的问题的细节吗?例如 - 您使用的查询是什么? @MakanTayebi - 请问您使用的是哪种编程语言? 我正在使用 php 进行编码。我可以使用 C 模块处理这个特定任务,但我当然更喜欢 elasticsearch 自己来完成。【参考方案2】:由于您只有 2 个字段,因此一种简单的方法是使用单个方面进行两个查询。男性:
"query" :
"term" : "gender" : "Male"
,
"facets" :
"age_range" :
"terms" :
"field" : "age_range"
对于女性:
"query" :
"term" : "gender" : "Female"
,
"facets" :
"age_range" :
"terms" :
"field" : "age_range"
或者您可以在单个查询中使用构面过滤器来执行此操作(有关详细信息,请参阅this link)
"query" :
"match_all":
,
"facets" :
"age_range_male" :
"terms" :
"field" : "age_range"
,
"facet_filter":
"term":
"gender": "Male"
,
"age_range_female" :
"terms" :
"field" : "age_range"
,
"facet_filter":
"term":
"gender": "Female"
更新:
因为构面即将被删除。这是聚合的解决方案:
"query":
"match_all":
,
"aggs":
"male":
"filter":
"term":
"gender": "Male"
,
"aggs":
"age_range":
"terms":
"field": "age_range"
,
"female":
"filter":
"term":
"gender": "Female"
,
"aggs":
"age_range":
"terms":
"field": "age_range"
【讨论】:
截至 2015 年 10 月 28 日星期三,elasticsearch 官网声明“Facets 已被弃用,将在未来的版本中删除。鼓励您改为迁移到聚合”。 我可以将 date_histogram 作为一个聚合吗?以上是关于ElasticSearch 按多个字段分组的主要内容,如果未能解决你的问题,请参考以下文章