Elasticsearch SQL 类似子查询聚合
Posted
技术标签:
【中文标题】Elasticsearch SQL 类似子查询聚合【英文标题】:Elasticsearch SQL like subquery aggregation 【发布时间】:2016-10-27 02:51:25 【问题描述】:我正在使用 ES 来了解它是否可以涵盖我的大部分场景。 我正处于思考如何达到在 SQL 中非常简单的特定结果的地步。
这是一个例子
在弹性中,我有这个文档的索引
"Id": 1, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160101, "BestBeforeDate": 20160102, "BiteBy":"John"
"Id": 2, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160102, "BestBeforeDate": 20160104, "BiteBy":"Mat"
"Id": 3, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160103, "BestBeforeDate": 20160105, "BiteBy":"Mark"
"Id": 4, "Fruit": "Banana", "BoughtInStore"="Jungle", "BoughtDate"=20160104, "BestBeforeDate": 20160201, "BiteBy":"Simon"
"Id": 5, "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160112, "BestBeforeDate": 20160112, "BiteBy":"John"
"Id": 6, "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160114, "BestBeforeDate": 20160116, "BiteBy":"Mark"
"Id": 7, "Fruit": "Orange", "BoughtInStore"="Jungle", "BoughtDate"=20160120, "BestBeforeDate": 20160121, "BiteBy":"Simon"
"Id": 8, "Fruit": "Kiwi", "BoughtInStore"="Shop", "BoughtDate"=20160121, "BestBeforeDate": 20160121, "BiteBy":"Mark"
"Id": 8, "Fruit": "Kiwi", "BoughtInStore"="Jungle", "BoughtDate"=20160121, "BestBeforeDate": 20160121, "BiteBy":"Simon"
如果我想知道在 SQL 中的特定日期范围内人们在不同商店购买了多少水果,我会写这样的代码
SELECT
COUNT(DISTINCT kpi.Fruit) as Fruits,
kpi.BoughtInStore,
kpi.BiteBy
FROM
(
SELECT f1.Fruit, f1.BoughtInStore, f1.BiteBy
FROM FruitsTable f1
WHERE f1.BoughtDate = (
SELECT MAX(f2.BoughtDate)
FROM FruitsTable f2
WHERE f1.Fruit = f2.Fruit
and f2.BoughtDate between 20160101 and 20160131
and (f2.BestBeforeDate between 20160101 and 20160131)
)
) kpi
GROUP BY kpi.BoughtInStore, kpi.ByteBy
结果是这样的
"Fruits": 1, "BoughtInStore": "Jungle", "BiteBy"="Mark"
"Fruits": 1, "BoughtInStore": "Shop", "BiteBy"="Mark"
"Fruits": 2, "BoughtInStore": "Jungle", "BiteBy"="Simon"
您知道如何通过聚合在 Elastic 中达到相同的结果吗?
简而言之,我在弹性方面面临的问题是:
-
如何在聚合之前准备一个子数据(例如在本例中,每个水果的范围内的最新行)
如何按多个字段对结果进行分组
谢谢
【问题讨论】:
【参考方案1】:据我了解,无法在同一查询的过滤器中引用聚合结果。因此,您只能通过单个查询解决部分难题:
GET /purchases/fruits/_search
"query":
"filtered":
"filter":
"range":
"BoughtDate":
"gte": "2015-01-01", //assuming you have right mapping for dates
"lte": "2016-03-01"
,
"sort": "BoughtDate": "order": "desc" ,
"aggs":
"byBoughtDate":
"terms":
"field": "BoughtDate",
"order" : "_term" : "desc"
,
"aggs":
"distinctCount":
"cardinality":
"field": "Fruit"
因此,您将拥有日期范围内的所有文档,并且您将获得按期限排序的汇总桶数,因此最大日期将位于顶部。客户端可以解析第一个桶(计数和值),然后获取该日期值的文档。对于不同的水果计数,您只需使用嵌套基数聚合。
是的,查询返回的信息比您需要的多得多,但这就是生活 :)
【讨论】:
【参考方案2】:当然,没有从 SQL 到 Elasticsearch DSL 的直接路由,但有一些非常常见的相关性。
对于初学者,任何GROUP BY
/ HAVING
都将归结为聚合。普通的查询语义通常可以被 Query DSL 覆盖(甚至更多)。
如何在聚合之前准备数据子集(例如在本例中,每个水果的范围内的最新行)
所以,你有点要求两种不同的东西。
如何在聚合之前准备一个子数据
这是查询阶段。
(如本例中每个水果范围内的最新行)
您在技术上要求它进行聚合以获得此示例的答案:不是普通查询。在您的示例中,您正在执行 MAX
来获取它,这实际上是使用 GROUP BY 来获取它。
如何按多个字段对结果进行分组
这取决于。您希望它们分层(通常是)还是希望它们在一起。
如果您希望它们分层,那么您只需使用子聚合来获得您想要的。如果您希望将它们组合在一起,那么您通常只需将 filters
聚合用于不同的分组。
将所有内容重新组合在一起:在给定特定过滤日期范围的情况下,您希望每个水果最近一次购买。日期范围只是普通的查询/过滤器:
"query":
"bool":
"filter": [
"range":
"BoughtDate":
"gte": "2016-01-01",
"lte": "2016-01-31"
,
"range":
"BestBeforeDate":
"gte": "2016-01-01",
"lte": "2016-01-31"
]
这样,请求中不会包含不在两个字段的日期范围内的文档(实际上是AND
)。因为我使用了过滤器,所以它是不计分且可缓存的。
现在,您需要开始汇总以获取其余信息。让我们首先假设文档已使用上述过滤器进行过滤,以简化我们正在查看的内容。我们将在最后合并它。
"size": 0,
"aggs":
"group_by_date":
"date_histogram":
"field": "BoughtDate",
"interval": "day",
"min_doc_count": 1
,
"aggs":
"group_by_store":
"terms":
"field": "BoughtInStore"
,
"aggs":
"group_by_person":
"terms":
"field": "BiteBy"
您希望"size" : 0
位于顶层,因为您实际上并不关心点击量。您只需要汇总结果。
您的第一个聚合实际上是按最近日期分组的。我对其进行了一些更改以使其更加真实(每天天),但实际上是相同的。您使用MAX
的方式,我们可以将terms
聚合与"size": 1
一起使用,但这更真实您希望在约会时(可能是时间! ) 参与。我还要求它忽略匹配文档中没有数据的日期(因为它从头到尾都在进行,我们实际上并不关心那些日子)。
如果您真的只想要最后一天,那么您可以使用管道聚合来删除除最大存储桶之外的所有内容,但此类请求的实际使用需要完整的日期范围。
然后,我们继续按商店分组,这正是您想要的。然后,我们按人分组 (BiteBy
)。这将隐含地为您提供计数。
将它们重新组合在一起:
"size": 0,
"query":
"bool":
"filter": [
"range":
"BoughtDate":
"gte": "2016-01-01",
"lte": "2016-01-31"
,
"range":
"BestBeforeDate":
"gte": "2016-01-01",
"lte": "2016-01-31"
]
,
"aggs":
"group_by_date":
"date_histogram":
"field": "BoughtDate",
"interval": "day",
"min_doc_count": 1
,
"aggs":
"group_by_store":
"terms":
"field": "BoughtInStore"
,
"aggs":
"group_by_person":
"terms":
"field": "BiteBy"
注意:这是我索引数据的方式。
PUT /grocery/store/_bulk
"index":"_id":"1"
"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-01","BestBeforeDate":"2016-01-02","BiteBy":"John"
"index":"_id":"2"
"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-02","BestBeforeDate":"2016-01-04","BiteBy":"Mat"
"index":"_id":"3"
"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-03","BestBeforeDate":"2016-01-05","BiteBy":"Mark"
"index":"_id":"4"
"Fruit":"Banana","BoughtInStore":"Jungle","BoughtDate":"2016-01-04","BestBeforeDate":"2016-02-01","BiteBy":"Simon"
"index":"_id":"5"
"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-12","BestBeforeDate":"2016-01-12","BiteBy":"John"
"index":"_id":"6"
"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-14","BestBeforeDate":"2016-01-16","BiteBy":"Mark"
"index":"_id":"7"
"Fruit":"Orange","BoughtInStore":"Jungle","BoughtDate":"2016-01-20","BestBeforeDate":"2016-01-21","BiteBy":"Simon"
"index":"_id":"8"
"Fruit":"Kiwi","BoughtInStore":"Shop","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Mark"
"index":"_id":"9"
"Fruit":"Kiwi","BoughtInStore":"Jungle","BoughtDate":"2016-01-21","BestBeforeDate":"2016-01-21","BiteBy":"Simon"
关键你想要聚合的字符串值(商店和人)是not_analyzed
string
s(ES 5.0 中的keyword
)!否则它将使用所谓的 fielddata,这不是一件好事。
映射在 ES 1.x / ES 2.x 中如下所示:
PUT /grocery
"settings":
"number_of_shards": 1
,
"mappings":
"store":
"properties":
"Fruit":
"type": "string",
"index": "not_analyzed"
,
"BoughtInStore":
"type": "string",
"index": "not_analyzed"
,
"BiteBy":
"type": "string",
"index": "not_analyzed"
,
"BestBeforeDate":
"type": "date"
,
"BoughtDate":
"type": "date"
所有这些加在一起,你得到的答案是:
"took": 8,
"timed_out": false,
"_shards":
"total": 1,
"successful": 1,
"failed": 0
,
"hits":
"total": 8,
"max_score": 0,
"hits": []
,
"aggregations":
"group_by_date":
"buckets": [
"key_as_string": "2016-01-01T00:00:00.000Z",
"key": 1451606400000,
"doc_count": 1,
"group_by_store":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Jungle",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "John",
"doc_count": 1
]
]
,
"key_as_string": "2016-01-02T00:00:00.000Z",
"key": 1451692800000,
"doc_count": 1,
"group_by_store":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Jungle",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Mat",
"doc_count": 1
]
]
,
"key_as_string": "2016-01-03T00:00:00.000Z",
"key": 1451779200000,
"doc_count": 1,
"group_by_store":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Jungle",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Mark",
"doc_count": 1
]
]
,
"key_as_string": "2016-01-12T00:00:00.000Z",
"key": 1452556800000,
"doc_count": 1,
"group_by_store":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Jungle",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "John",
"doc_count": 1
]
]
,
"key_as_string": "2016-01-14T00:00:00.000Z",
"key": 1452729600000,
"doc_count": 1,
"group_by_store":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Jungle",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Mark",
"doc_count": 1
]
]
,
"key_as_string": "2016-01-20T00:00:00.000Z",
"key": 1453248000000,
"doc_count": 1,
"group_by_store":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Jungle",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Simon",
"doc_count": 1
]
]
,
"key_as_string": "2016-01-21T00:00:00.000Z",
"key": 1453334400000,
"doc_count": 2,
"group_by_store":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Jungle",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Simon",
"doc_count": 1
]
,
"key": "Shop",
"doc_count": 1,
"group_by_person":
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
"key": "Mark",
"doc_count": 1
]
]
]
【讨论】:
到目前为止,我很少注意到的将存储桶聚合限制为最大日期的解决方法不适用于date_histogram
。具有讽刺意味的是,如果我像您最初显示的那样将值保留为数字,它会起作用。以上是关于Elasticsearch SQL 类似子查询聚合的主要内容,如果未能解决你的问题,请参考以下文章
Elasticsearch聚合后将聚合结果进行分页的解决办法
elasticSearch Java API 怎么将查询出来的数据类似sql 一样的distinct 去重某个字段