Elasticsearch 聚合分析
Posted liuhmmjj
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch 聚合分析相关的知识,希望对你有一定的参考价值。
Elasticsearch聚合定义
聚合有助于基于搜索查询提供聚合数据。 它基于称为聚合的简单构建块,可以组合以构建复杂的数据。
基本语法结构如下:
"aggregations" :
"<aggregation_name>" :
"<aggregation_type>" :
<aggregation_body>
[,"meta" : [<meta_data_body>] ]?
[,"aggregations" : [<sub_aggregation>]+ ]?
[,"<aggregation_name_2>" : ... ]*
Elasticsearch聚合分类
es将聚合分析主要分为如下4类:
Bucket
:分桶类型,类似SQL中的GROUP BY语法Metric
:指标分析类型,如计算最大值、最小值、平均值等等Pipeline
:管道分析类型,基于上一级的聚合分析结果进行在分析Matrix
:矩阵分析类型
先准备数据:
POST /cars/transactions/_bulk
"index":
"price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28"
"index":
"price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05"
"index":
"price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18"
"index":
"price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02"
"index":
"price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19"
"index":
"price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05"
"index":
"price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01"
"index":
"price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12"
Metric聚合分析
Metric聚合分析分为单值分析和多值分析两类:
- 单值分析,只输出一个分析结果
min,max,avg,sum
cardinality
多值分析,输出多个分析结果
stats,extended stats
percentile,percentile rank
top hits
min,max,avg,sum
样例:
get /cars/transactions/_search
"size": 0,//不返回文档列表
"aggs":
"price_max":
"max":
"field": "price"
,
"price_min":
"min":
"field": "price"
,
"avg_price":
"avg":
"field":"price"
,
"sum_price":
"sum":
"field":"price"
cardinality
ardinality
:意为集合的势,或者基数,是指不同数值的个数,类似SQL中的distinct count
概念。
样例:
get /cars/transactions/_search
"size": 0,//不返回文档列表
"aggs":
"count_of_make":
"cardinality":
"field": "make.keyword"
stats,extended stats
stats
:返回一系列数值类型的统计值,包含min、max、avg、sum
和count
extended stats
:对stats的扩展,包含了更多的统计数据,比如方差、标准差等
样例:
get /cars/transactions/_search
"size": 0,
"aggs":
"stats_price":
"stats":
"field": "price"
Percentile,Percentile Rank
Percentile
: 百分位数统计。Percentile Rank
: 百分位数统计
Top Hits
Top Hits
: 一般用于分桶后获取该桶内匹配的顶部文档列表,即详情数据
例如根据汽车厂商进行分组,并取每组价格最高的两条transactions(交易)数据
get /cars/transactions/_search
"size": 0,
"aggs":
"group_by_color":
"terms":
"field": "make.keyword"
,
"aggs":
"top_data":
"top_hits":
"size": 2,
"_source": [
"price",
"color",
"make"
],
"sort": [
"price":
"order": "desc"
]
结果:
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
"took" : 8,
"timed_out" : false,
"_shards" :
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
,
"hits" :
"total" :
"value" : 8,
"relation" : "eq"
,
"max_score" : null,
"hits" : [ ]
,
"aggregations" :
"group_by_color" :
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "honda",
"doc_count" : 3,
"top_data" :
"hits" :
"total" :
"value" : 3,
"relation" : "eq"
,
"max_score" : null,
"hits" : [
"_index" : "cars",
"_type" : "transactions",
"_id" : "js_K120B6sb1aJIMtJKa",
"_score" : null,
"_source" :
"color" : "red",
"price" : 20000,
"make" : "honda"
,
"sort" : [
20000
]
,
"_index" : "cars",
"_type" : "transactions",
"_id" : "ks_K120B6sb1aJIMtJKa",
"_score" : null,
"_source" :
"color" : "red",
"price" : 20000,
"make" : "honda"
,
"sort" : [
20000
]
]
,
"key" : "ford",
"doc_count" : 2,
"top_data" :
"hits" :
"total" :
"value" : 2,
"relation" : "eq"
,
"max_score" : null,
"hits" : [
"_index" : "cars",
"_type" : "transactions",
"_id" : "j8_K120B6sb1aJIMtJKa",
"_score" : null,
"_source" :
"color" : "green",
"price" : 30000,
"make" : "ford"
,
"sort" : [
30000
]
,
"_index" : "cars",
"_type" : "transactions",
"_id" : "lM_K120B6sb1aJIMtJKa",
"_score" : null,
"_source" :
"color" : "blue",
"price" : 25000,
"make" : "ford"
,
"sort" : [
25000
]
]
,
"key" : "toyota",
"doc_count" : 2,
"top_data" :
"hits" :
"total" :
"value" : 2,
"relation" : "eq"
,
"max_score" : null,
"hits" : [
"_index" : "cars",
"_type" : "transactions",
"_id" : "kM_K120B6sb1aJIMtJKa",
"_score" : null,
"_source" :
"color" : "blue",
"price" : 15000,
"make" : "toyota"
,
"sort" : [
15000
]
,
"_index" : "cars",
"_type" : "transactions",
"_id" : "kc_K120B6sb1aJIMtJKa",
"_score" : null,
"_source" :
"color" : "green",
"price" : 12000,
"make" : "toyota"
,
"sort" : [
12000
]
]
,
"key" : "bmw",
"doc_count" : 1,
"top_data" :
"hits" :
"total" :
"value" : 1,
"relation" : "eq"
,
"max_score" : null,
"hits" : [
"_index" : "cars",
"_type" : "transactions",
"_id" : "k8_K120B6sb1aJIMtJKa",
"_score" : null,
"_source" :
"color" : "red",
"price" : 80000,
"make" : "bmw"
,
"sort" : [
80000
]
]
]
Bucketing聚合
基于检索构成了逻辑文档组,满足特定规则的文档放置到一个桶里,每一个桶关联一个key。
类比mysql中的group by操作,
最简单的分桶策略,直接按照term来分桶,如果是text
类型,则按照分词后的结果分桶
get /cars/transactions/_search
"size": 0,
"aggs":
"group_by_color":
"terms":
"field": "color.keyword"
,
"group_by_make":
"terms":
"field": "make.keyword"
注意点:如果不加.keyword会报错:
"error":
"root_cause": [
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [color] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
],
...
Elasticsearch 5.x 版本开始支持通过text的内置字段keyword作精确查询、聚合分析.
Range,Date Range
Range
: 通过制定数值的范围来设定分桶规则Date Range
: 通过指定日期的范围来设定分桶规则
样例:
get /cars/transactions/_search
"size": 0,
"aggs":
"range_price":
"range":
"field": "price",
"ranges": [
"to": 20000
,
"from": 20000,
"to": 30000
,
"from":50000
]
结果:
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
"took" : 0,
"timed_out" : false,
"_shards" :
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
,
"hits" :
"total" :
"value" : 8,
"relation" : "eq"
,
"max_score" : null,
"hits" : [ ]
,
"aggregations" :
"range_price" :
"buckets" : [
"key" : "*-20000.0",
"to" : 20000.0,
"doc_count" : 3
,
"key" : "20000.0-30000.0",
"from" : 20000.0,
"to" : 30000.0,
"doc_count" : 3
,
"key" : "50000.0-*",
"from" : 50000.0,
"doc_count" : 1
]
Historgram,Date Histogram
Historgram
: 直方图,以固定间隔的策略来分割数据Date Histogram
: 针对日期的直方图或者柱状图,是时序分析中常用的聚合分析类型
示例:
get /cars/transactions/_search
"size": 0,
"aggs":
"hist_price":
"histogram":
"field": "price",
"interval": 20000,
"extended_bounds":
"min": 10000,
"max": 80000
结果:
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
"took" : 0,
"timed_out" : false,
"_shards" :
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
,
"hits" :
"total" :
"value" : 8,
"relation" : "eq"
,
"max_score" : null,
"hits" : [ ]
,
"aggregations" :
"hist_price" :
"buckets" : [
"key" : 0.0,
"doc_count" : 3
,
"key" : 20000.0,
"doc_count" : 4
,
"key" : 40000.0,
"doc_count" : 0
,
"key" : 60000.0,
"doc_count" : 0
,
"key" : 80000.0,
"doc_count" : 1
]
Bucket + Metric聚合分析
Bucket聚合分析允许通过子分析来进一步进行分析,该分析可以是Bucket也可以是Metric,这也使得es的聚合分析能力变得异常强大
(1)分桶之后再分桶
get /cars/transactions/_search
"size": 0,
"aggs":
"group_by_make":
"terms":
"field": "make.keyword"
,
"aggs":
"range_price":
"range":
"field": "price",
"ranges": [
"to": 20000
,
"from": 20000,
"to": 30000
,
"from": 50000
]
结果:
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
"took" : 4,
"timed_out" : false,
"_shards" :
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
,
"hits" :
"total" :
"value" : 8,
"relation" : "eq"
,
"max_score" : null,
"hits" : [ ]
,
"aggregations" :
"group_by_make" :
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "honda",
"doc_count" : 3,
"range_price" :
"buckets" : [
"key" : "*-20000.0",
"to" : 20000.0,
"doc_count" : 1
,
"key" : "20000.0-30000.0",
"from" : 20000.0,
"to" : 30000.0,
"doc_count" : 2
,
"key" : "50000.0-*",
"from" : 50000.0,
"doc_count" : 0
]
,
"key" : "ford",
"doc_count" : 2,
"range_price" :
"buckets" : [
"key" : "*-20000.0",
"to" : 20000.0,
"doc_count" : 0
,
"key" : "20000.0-30000.0",
"from" : 20000.0,
"to" : 30000.0,
"doc_count" : 1
,
"key" : "50000.0-*",
"from" : 50000.0,
"doc_count" : 0
]
,
"key" : "toyota",
"doc_count" : 2,
"range_price" :
"buckets" : [
"key" : "*-20000.0",
"to" : 20000.0,
"doc_count" : 2
,
"key" : "20000.0-30000.0",
"from" : 20000.0,
"to" : 30000.0,
"doc_count" : 0
,
"key" : "50000.0-*",
"from" : 50000.0,
"doc_count" : 0
]
,
"key" : "bmw",
"doc_count" : 1,
"range_price" :
"buckets" : [
"key" : "*-20000.0",
"to" : 20000.0,
"doc_count" : 0
,
"key" : "20000.0-30000.0",
"from" : 20000.0,
"to" : 30000.0,
"doc_count" : 0
,
"key" : "50000.0-*",
"from" : 50000.0,
"doc_count" : 1
]
]
(2)分桶后进行数据分析
get /cars/transactions/_search
"size": 0,
"aggs":
"group_by_make":
"terms":
"field": "make.keyword"
,
"aggs":
"stats_price":
"stats":
"field": "price"
结果:
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
"took" : 1,
"timed_out" : false,
"_shards" :
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
,
"hits" :
"total" :
"value" : 8,
"relation" : "eq"
,
"max_score" : null,
"hits" : [ ]
,
"aggregations" :
"group_by_make" :
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "honda",
"doc_count" : 3,
"stats_price" :
"count" : 3,
"min" : 10000.0,
"max" : 20000.0,
"avg" : 16666.666666666668,
"sum" : 50000.0
,
"key" : "ford",
"doc_count" : 2,
"stats_price" :
"count" : 2,
"min" : 25000.0,
"max" : 30000.0,
"avg" : 27500.0,
"sum" : 55000.0
,
"key" : "toyota",
"doc_count" : 2,
"stats_price" :
"count" : 2,
"min" : 12000.0,
"max" : 15000.0,
"avg" : 13500.0,
"sum" : 27000.0
,
"key" : "bmw",
"doc_count" : 1,
"stats_price" :
"count" : 1,
"min" : 80000.0,
"max" : 80000.0,
"avg" : 80000.0,
"sum" : 80000.0
]
聚合分析中的排序
根据厂商分组后并按照价格进行降序排列:
get /cars/transactions/_search
"size": 0,
"aggs":
"group_by_make":
"terms":
"field": "make.keyword",
"order":
"avg_price": "desc"
,
"aggs":
"avg_price":
"avg":
"field": "price"
结果:
#! Deprecation: [types removal] Specifying types in search requests is deprecated.
"took" : 26,
"timed_out" : false,
"_shards" :
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
,
"hits" :
"total" :
"value" : 8,
"relation" : "eq"
,
"max_score" : null,
"hits" : [ ]
,
"aggregations" :
"group_by_make" :
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
"key" : "bmw",
"doc_count" : 1,
"avg_price" :
"value" : 80000.0
,
"key" : "ford",
"doc_count" : 2,
"avg_price" :
"value" : 27500.0
,
"key" : "honda",
"doc_count" : 3,
"avg_price" :
"value" : 16666.666666666668
,
"key" : "toyota",
"doc_count" : 2,
"avg_price" :
"value" : 13500.0
]
以上是关于Elasticsearch 聚合分析的主要内容,如果未能解决你的问题,请参考以下文章