Elasticsearch 聚合分析

Posted liuhmmjj

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch 聚合分析相关的知识,希望对你有一定的参考价值。

Elasticsearch聚合定义

聚合有助于基于搜索查询提供聚合数据。 它基于称为聚合的简单构建块,可以组合以构建复杂的数据。
基本语法结构如下:

"aggregations" : 
    "<aggregation_name>" : 
        "<aggregation_type>" : 
            <aggregation_body>
        
        [,"meta" :   [<meta_data_body>]  ]?
        [,"aggregations" :  [<sub_aggregation>]+  ]?
    
    [,"<aggregation_name_2>" :  ...  ]*

Elasticsearch聚合分类

es将聚合分析主要分为如下4类:

  • Bucket:分桶类型,类似SQL中的GROUP BY语法
  • Metric:指标分析类型,如计算最大值、最小值、平均值等等
  • Pipeline:管道分析类型,基于上一级的聚合分析结果进行在分析
  • Matrix:矩阵分析类型

先准备数据:

POST /cars/transactions/_bulk
 "index": 
 "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" 
 "index": 
 "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" 
 "index": 
 "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" 
 "index": 
 "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" 
 "index": 
 "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" 
 "index": 
 "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" 
 "index": 
 "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" 
 "index": 
 "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" 

Metric聚合分析

Metric聚合分析分为单值分析和多值分析两类:

  • 单值分析,只输出一个分析结果
min,max,avg,sum
cardinality

多值分析,输出多个分析结果

stats,extended stats
percentile,percentile rank
top hits 

min,max,avg,sum

样例:

get /cars/transactions/_search

  "size": 0,//不返回文档列表
  "aggs":
    "price_max":
      "max": 
        "field": "price"
      
    ,
    
    "price_min":
      "min": 
        "field": "price"
      
    ,
    "avg_price":
      "avg":
        "field":"price"
      
    ,
    "sum_price":
      "sum":
        "field":"price"
      
    
  
  

cardinality

ardinality:意为集合的势,或者基数,是指不同数值的个数,类似SQL中的distinct count概念
样例:

get /cars/transactions/_search

  "size": 0,//不返回文档列表
  "aggs":
    "count_of_make":
      "cardinality": 
        "field": "make.keyword"
      
    
  
  

stats,extended stats


  • stats:返回一系列数值类型的统计值,包含min、max、avg、sumcount
  • extended stats:对stats的扩展,包含了更多的统计数据,比如方差、标准差等

样例:

get /cars/transactions/_search

  "size": 0,
  "aggs":
    "stats_price":
      "stats": 
        "field": "price"
      
    
  
  

Percentile,Percentile Rank


  • Percentile: 百分位数统计。
  • Percentile Rank: 百分位数统计

Top Hits

Top Hits: 一般用于分桶后获取该桶内匹配的顶部文档列表,即详情数据

例如根据汽车厂商进行分组,并取每组价格最高的两条transactions(交易)数据

get /cars/transactions/_search

  "size": 0,
  "aggs": 
    "group_by_color": 
      "terms": 
        "field": "make.keyword"
      ,
      "aggs": 
      "top_data": 
        "top_hits": 
          "size": 2,
          "_source": [
            "price",
            "color",
            "make"
          ],
          "sort": [
            
              "price": 
                "order": "desc"
              
            
          ]
        
      
    
    
    
  

结果:

#! Deprecation: [types removal] Specifying types in search requests is deprecated.

  "took" : 8,
  "timed_out" : false,
  "_shards" : 
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  ,
  "hits" : 
    "total" : 
      "value" : 8,
      "relation" : "eq"
    ,
    "max_score" : null,
    "hits" : [ ]
  ,
  "aggregations" : 
    "group_by_color" : 
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        
          "key" : "honda",
          "doc_count" : 3,
          "top_data" : 
            "hits" : 
              "total" : 
                "value" : 3,
                "relation" : "eq"
              ,
              "max_score" : null,
              "hits" : [
                
                  "_index" : "cars",
                  "_type" : "transactions",
                  "_id" : "js_K120B6sb1aJIMtJKa",
                  "_score" : null,
                  "_source" : 
                    "color" : "red",
                    "price" : 20000,
                    "make" : "honda"
                  ,
                  "sort" : [
                    20000
                  ]
                ,
                
                  "_index" : "cars",
                  "_type" : "transactions",
                  "_id" : "ks_K120B6sb1aJIMtJKa",
                  "_score" : null,
                  "_source" : 
                    "color" : "red",
                    "price" : 20000,
                    "make" : "honda"
                  ,
                  "sort" : [
                    20000
                  ]
                
              ]
            
          
        ,
        
          "key" : "ford",
          "doc_count" : 2,
          "top_data" : 
            "hits" : 
              "total" : 
                "value" : 2,
                "relation" : "eq"
              ,
              "max_score" : null,
              "hits" : [
                
                  "_index" : "cars",
                  "_type" : "transactions",
                  "_id" : "j8_K120B6sb1aJIMtJKa",
                  "_score" : null,
                  "_source" : 
                    "color" : "green",
                    "price" : 30000,
                    "make" : "ford"
                  ,
                  "sort" : [
                    30000
                  ]
                ,
                
                  "_index" : "cars",
                  "_type" : "transactions",
                  "_id" : "lM_K120B6sb1aJIMtJKa",
                  "_score" : null,
                  "_source" : 
                    "color" : "blue",
                    "price" : 25000,
                    "make" : "ford"
                  ,
                  "sort" : [
                    25000
                  ]
                
              ]
            
          
        ,
        
          "key" : "toyota",
          "doc_count" : 2,
          "top_data" : 
            "hits" : 
              "total" : 
                "value" : 2,
                "relation" : "eq"
              ,
              "max_score" : null,
              "hits" : [
                
                  "_index" : "cars",
                  "_type" : "transactions",
                  "_id" : "kM_K120B6sb1aJIMtJKa",
                  "_score" : null,
                  "_source" : 
                    "color" : "blue",
                    "price" : 15000,
                    "make" : "toyota"
                  ,
                  "sort" : [
                    15000
                  ]
                ,
                
                  "_index" : "cars",
                  "_type" : "transactions",
                  "_id" : "kc_K120B6sb1aJIMtJKa",
                  "_score" : null,
                  "_source" : 
                    "color" : "green",
                    "price" : 12000,
                    "make" : "toyota"
                  ,
                  "sort" : [
                    12000
                  ]
                
              ]
            
          
        ,
        
          "key" : "bmw",
          "doc_count" : 1,
          "top_data" : 
            "hits" : 
              "total" : 
                "value" : 1,
                "relation" : "eq"
              ,
              "max_score" : null,
              "hits" : [
                
                  "_index" : "cars",
                  "_type" : "transactions",
                  "_id" : "k8_K120B6sb1aJIMtJKa",
                  "_score" : null,
                  "_source" : 
                    "color" : "red",
                    "price" : 80000,
                    "make" : "bmw"
                  ,
                  "sort" : [
                    80000
                  ]
                
              ]
            
          
        
      ]
    
  

Bucketing聚合

基于检索构成了逻辑文档组,满足特定规则的文档放置到一个桶里,每一个桶关联一个key。


类比mysql中的group by操作,

最简单的分桶策略,直接按照term来分桶,如果是text类型,则按照分词后的结果分桶

get /cars/transactions/_search

  "size": 0,
  "aggs": 
    "group_by_color": 
      "terms": 
        "field": "color.keyword"
      
    ,
     "group_by_make": 
      "terms": 
        "field": "make.keyword"
      
    
  

注意点如果不加.keyword会报错:

"error": 
    "root_cause": [
      
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [color] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      
    ],
...

Elasticsearch 5.x 版本开始支持通过text的内置字段keyword作精确查询、聚合分析.

Range,Date Range

  • Range: 通过制定数值的范围来设定分桶规则
  • Date Range: 通过指定日期的范围来设定分桶规则

样例:

get /cars/transactions/_search

  "size": 0,
  "aggs": 
    "range_price": 
      "range": 
        "field": "price",
        "ranges": [
          
            "to": 20000
          ,
          
            "from": 20000,
            "to": 30000
          ,
          
            "from":50000
          
        ]
      
    
  

结果:

#! Deprecation: [types removal] Specifying types in search requests is deprecated.

  "took" : 0,
  "timed_out" : false,
  "_shards" : 
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  ,
  "hits" : 
    "total" : 
      "value" : 8,
      "relation" : "eq"
    ,
    "max_score" : null,
    "hits" : [ ]
  ,
  "aggregations" : 
    "range_price" : 
      "buckets" : [
        
          "key" : "*-20000.0",
          "to" : 20000.0,
          "doc_count" : 3
        ,
        
          "key" : "20000.0-30000.0",
          "from" : 20000.0,
          "to" : 30000.0,
          "doc_count" : 3
        ,
        
          "key" : "50000.0-*",
          "from" : 50000.0,
          "doc_count" : 1
        
      ]
    
  

Historgram,Date Histogram


  • Historgram: 直方图,以固定间隔的策略来分割数据
  • Date Histogram: 针对日期的直方图或者柱状图,是时序分析中常用的聚合分析类型

示例:

get /cars/transactions/_search

  "size": 0,
  "aggs": 
    "hist_price": 
      "histogram": 
        "field": "price",
        "interval": 20000, 
        "extended_bounds": 
          
            "min": 10000,
            "max": 80000
          
      
    
  

结果:

#! Deprecation: [types removal] Specifying types in search requests is deprecated.

  "took" : 0,
  "timed_out" : false,
  "_shards" : 
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  ,
  "hits" : 
    "total" : 
      "value" : 8,
      "relation" : "eq"
    ,
    "max_score" : null,
    "hits" : [ ]
  ,
  "aggregations" : 
    "hist_price" : 
      "buckets" : [
        
          "key" : 0.0,
          "doc_count" : 3
        ,
        
          "key" : 20000.0,
          "doc_count" : 4
        ,
        
          "key" : 40000.0,
          "doc_count" : 0
        ,
        
          "key" : 60000.0,
          "doc_count" : 0
        ,
        
          "key" : 80000.0,
          "doc_count" : 1
        
      ]
    
  

Bucket + Metric聚合分析

Bucket聚合分析允许通过子分析来进一步进行分析,该分析可以是Bucket也可以是Metric,这也使得es的聚合分析能力变得异常强大

(1)分桶之后再分桶

get /cars/transactions/_search

  "size": 0,
  "aggs": 
    "group_by_make": 
      "terms": 
        "field": "make.keyword"
      ,
      "aggs": 
        "range_price": 
          "range": 
            "field": "price",
            "ranges": [
              
                "to": 20000
              ,
              
                "from": 20000,
                "to": 30000
              ,
              
                "from": 50000
              
            ]
          
        
      
    
  

结果:

#! Deprecation: [types removal] Specifying types in search requests is deprecated.

  "took" : 4,
  "timed_out" : false,
  "_shards" : 
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  ,
  "hits" : 
    "total" : 
      "value" : 8,
      "relation" : "eq"
    ,
    "max_score" : null,
    "hits" : [ ]
  ,
  "aggregations" : 
    "group_by_make" : 
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        
          "key" : "honda",
          "doc_count" : 3,
          "range_price" : 
            "buckets" : [
              
                "key" : "*-20000.0",
                "to" : 20000.0,
                "doc_count" : 1
              ,
              
                "key" : "20000.0-30000.0",
                "from" : 20000.0,
                "to" : 30000.0,
                "doc_count" : 2
              ,
              
                "key" : "50000.0-*",
                "from" : 50000.0,
                "doc_count" : 0
              
            ]
          
        ,
        
          "key" : "ford",
          "doc_count" : 2,
          "range_price" : 
            "buckets" : [
              
                "key" : "*-20000.0",
                "to" : 20000.0,
                "doc_count" : 0
              ,
              
                "key" : "20000.0-30000.0",
                "from" : 20000.0,
                "to" : 30000.0,
                "doc_count" : 1
              ,
              
                "key" : "50000.0-*",
                "from" : 50000.0,
                "doc_count" : 0
              
            ]
          
        ,
        
          "key" : "toyota",
          "doc_count" : 2,
          "range_price" : 
            "buckets" : [
              
                "key" : "*-20000.0",
                "to" : 20000.0,
                "doc_count" : 2
              ,
              
                "key" : "20000.0-30000.0",
                "from" : 20000.0,
                "to" : 30000.0,
                "doc_count" : 0
              ,
              
                "key" : "50000.0-*",
                "from" : 50000.0,
                "doc_count" : 0
              
            ]
          
        ,
        
          "key" : "bmw",
          "doc_count" : 1,
          "range_price" : 
            "buckets" : [
              
                "key" : "*-20000.0",
                "to" : 20000.0,
                "doc_count" : 0
              ,
              
                "key" : "20000.0-30000.0",
                "from" : 20000.0,
                "to" : 30000.0,
                "doc_count" : 0
              ,
              
                "key" : "50000.0-*",
                "from" : 50000.0,
                "doc_count" : 1
              
            ]
          
        
      ]
    
  

(2)分桶后进行数据分析

get /cars/transactions/_search

  "size": 0,
  "aggs": 
    "group_by_make": 
      "terms": 
        "field": "make.keyword"
      ,
      "aggs": 
        "stats_price":
          "stats": 
            "field": "price"
          
        
      
    
  

结果:

#! Deprecation: [types removal] Specifying types in search requests is deprecated.

  "took" : 1,
  "timed_out" : false,
  "_shards" : 
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  ,
  "hits" : 
    "total" : 
      "value" : 8,
      "relation" : "eq"
    ,
    "max_score" : null,
    "hits" : [ ]
  ,
  "aggregations" : 
    "group_by_make" : 
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        
          "key" : "honda",
          "doc_count" : 3,
          "stats_price" : 
            "count" : 3,
            "min" : 10000.0,
            "max" : 20000.0,
            "avg" : 16666.666666666668,
            "sum" : 50000.0
          
        ,
        
          "key" : "ford",
          "doc_count" : 2,
          "stats_price" : 
            "count" : 2,
            "min" : 25000.0,
            "max" : 30000.0,
            "avg" : 27500.0,
            "sum" : 55000.0
          
        ,
        
          "key" : "toyota",
          "doc_count" : 2,
          "stats_price" : 
            "count" : 2,
            "min" : 12000.0,
            "max" : 15000.0,
            "avg" : 13500.0,
            "sum" : 27000.0
          
        ,
        
          "key" : "bmw",
          "doc_count" : 1,
          "stats_price" : 
            "count" : 1,
            "min" : 80000.0,
            "max" : 80000.0,
            "avg" : 80000.0,
            "sum" : 80000.0
          
        
      ]
    
  

聚合分析中的排序

根据厂商分组后并按照价格进行降序排列:

get /cars/transactions/_search

  "size": 0,
  "aggs": 
    "group_by_make": 
      "terms": 
        "field": "make.keyword",
        "order": 
          "avg_price": "desc"
        
      ,
      "aggs": 
        "avg_price":
          "avg": 
            "field": "price"
          
        
      
    
  

结果:

#! Deprecation: [types removal] Specifying types in search requests is deprecated.

  "took" : 26,
  "timed_out" : false,
  "_shards" : 
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  ,
  "hits" : 
    "total" : 
      "value" : 8,
      "relation" : "eq"
    ,
    "max_score" : null,
    "hits" : [ ]
  ,
  "aggregations" : 
    "group_by_make" : 
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        
          "key" : "bmw",
          "doc_count" : 1,
          "avg_price" : 
            "value" : 80000.0
          
        ,
        
          "key" : "ford",
          "doc_count" : 2,
          "avg_price" : 
            "value" : 27500.0
          
        ,
        
          "key" : "honda",
          "doc_count" : 3,
          "avg_price" : 
            "value" : 16666.666666666668
          
        ,
        
          "key" : "toyota",
          "doc_count" : 2,
          "avg_price" : 
            "value" : 13500.0
          
        
      ]
    
  

 

以上是关于Elasticsearch 聚合分析的主要内容,如果未能解决你的问题,请参考以下文章

elasticsearch系列六:聚合分析(聚合分析简介指标聚合桶聚合)

ElasticSearch聚合分析

elasticsearch的嵌套聚合,下钻分析,聚合分析

4.elasticsearch聚合分析

4.elasticsearch聚合分析

Elasticsearch 聚合分析