具有高基数字段的 ElasticSearch 术语和基数性能

Posted 2023-04-15

技术标签:

【中文标题】具有高基数字段的 ElasticSearch 术语和基数性能【英文标题】：ElasticSearch terms and cardinality performance with high cardinality fields 【发布时间】：2017-05-28 07:58:23 【问题描述】：

TL;DR

与 SQL Server 上的相同查询相比，我的 ElasticSearch 查询需要很长时间。难道我做错了什么？有什么方法可以提高我的查询性能吗？这只是 RDBMS 比 NoSQL 做得更好的事情之一吗？

前提

假设我有一家接受订单并交付所需物品的公司。

我想知道每个订单的平均独特商品数量。我的订单数据按订购的商品排列 - 每个订单都有一条或多条记录，其中包含订单 ID、商品 ID 等。我有一个用于开发目的的单节点设置无论我有 4 GB 堆空间（在 12 GB 机器上）还是 16 GB 堆空间（在 32 GB 机器上），结果（性能方面）都是相同的索引有数十亿条记录，但查询将其过滤为大约 300,000 条记录订单和商品 ID 的类型是 keyword（本质上是文本），我无法更改。在这种特殊情况下，平均唯一商品数为 1.65 - 许多订单仅包含一个唯一商品，其他订单包含 2 个，少数包含多达 25 个唯一商品。

问题

使用 ElasticSearch，我将不得不使用 Terms Aggregation 按订单 ID 对文档进行分组，Cardinality Aggregation 以获得唯一项目数，以及 Average Bucket em> 聚合以获取每个订单的平均商品数。

这两个设置都需要大约 23 秒。在 SQL Server 上使用相同的数据集执行相同的查询不到 2 秒。

附加信息

ElasticSearch 查询


   "size":0,
   "query":
      "bool":
         "filter":[
            
               ...
            
         ]
      
   ,
   "aggs":
      "OrdersBucket":
         "terms":
            "field":"orderID",
            "execution_hint":"global_ordinals_hash",
            "size":10000000
         ,
         "aggs":
            "UniqueItems":
               "cardinality":
                  "field":"itemID"
               
            
         
      ,
      "AverageItemCount":
         "avg_bucket":
            "buckets_path":"OrdersBucket>UniqueItems"

起初，我的查询生成了 OutOfMemoryException，导致我的服务器停机。在我更高的 ram 设置上发出相同的请求会产生以下断路器：

[request] Data too large, data for [<reused_arrays>] would be
[14383258184/13.3gb], which is larger than the limit of
[10287002419/9.5gb]

ElasticSearch github 在这个问题上有几个（当前）未解决的问题：

Cardinality aggregation should not reserve a fixed amount of memory per bucket #15892

global_ordinals execution mode for the terms aggregation has an adversarially impact on children aggregations that expect dense buckets #24788

Heap Explosion on even small cardinality queries in ES 5.3.1 / Kibana 5.3.1 #24359

所有这些都导致我使用执行提示“global_ordinals_hash”，它允许查询成功完成（尽管需要时间..）

类比 SQL 查询

SELECT AVG(CAST(uniqueCount.amount AS FLOAT)) FROM 
(   SELECT o.OrderID, COUNT(DISTINCT o.ItemID) AS amount 
    FROM Orders o
    WHERE ...
    GROUP BY o.OrderID 
) uniqueCount

正如我所说，这非常非常快。

orderID 字段映射


   "orderID":
      "full_name":"orderID",
      "mapping":
         "orderID":
            "type":"keyword",
            "boost":1,
            "index":true,
            "store":false,
            "doc_values":true,
            "term_vector":"no",
            "norms":false,
            "index_options":"docs",
            "eager_global_ordinals":true,
            "similarity":"BM25",
            "fields":
               "autocomplete":
                  "type":"text",
                  "boost":1,
                  "index":true,
                  "store":false,
                  "doc_values":false,
                  "term_vector":"no",
                  "norms":true,
                  "index_options":"positions",
                  "eager_global_ordinals":false,
                  "similarity":"BM25",
                  "analyzer":"autocomplete",
                  "search_analyzer":"standard",
                  "search_quote_analyzer":"standard",
                  "include_in_all":true,
                  "position_increment_gap":-1,
                  "fielddata":false
               
            ,
            "null_value":null,
            "include_in_all":true,
            "ignore_above":2147483647,
            "normalizer":null

我已设置 eager_global_ordinals 试图提高性能，但无济于事。

示例文档


            "_index": "81cec0acbca6423aa3c2feed5dbccd98",
            "_type": "order",
            "_id": "AVwpLZ7GK9DJVcpvrzss",
            "_score": 0,
            "_source": 
        ...
               "orderID": "904044A",
               "itemID": "23KN",
        ...

为了简洁和不公开的内容，删除了不相关的字段

样本输出


   "OrdersBucket":
      "doc_count_error_upper_bound":0,
      "sum_other_doc_count":0,
      "buckets":[
         
            "key":"910117A",
            "doc_count":16,
            "UniqueItems":
               "value":16
            
         ,
         
            "key":"910966A",
            "doc_count":16,
            "UniqueItems":
               "value":16
            
         ,
        ...
         
            "key":"912815A",
            "doc_count":1,
            "UniqueItems":
               "value":1
            
         ,
         
            "key":"912816A",
            "doc_count":1,
            "UniqueItems":
               "value":1
            
         
      ]
   ,
   "AverageItemCount":
      "value":1.3975020363833832

任何帮助将不胜感激:)

【问题讨论】：

您能否分享一个示例文档和示例输出。那会很有帮助将这些编辑到问题中。虽然我看不出他们有多大帮助（除了让问题更 TL;DR 值得;））这是否可以让您更改索引的结构。通常建议以易于查询的方式索引您的数据。并且由于 ES 是一个无 sql 数据库，我们最好将数据保持为非规范化的形式。嗨@Richa，这实际上是未规范化的形式......您将如何进一步取消规范化？此外，这个问题可以很容易地概括——术语聚合中的基数聚合模式，然后是管道聚合并不特定于我的结构（据我所知，这是最不规范的）。我目前正在测试使该字段数字化的效果，但问题仍然存在 - 这是否仅适用于 RDBMS，还是我可以采取任何措施来继续使用 ES 进行此类工作？为简洁起见删除的字段包括日期、客户 ID 等。在标准化形式中，它们不会存在于项目级别，而仅存在于订单级别。 【参考方案1】：

显然 SQL Server 在缓存这些结果方面做得很好。进一步调查表明，初始查询所用的时间与 ElasticSearch 相同。

我将通过 ElasticSearch 研究为什么这些结果没有被正确缓存。

我还设法将订单 ID 转换为整数，这极大地提高了性能（尽管与 SQL Server 的性能提升相同）。

另外，as advised by Mark Harwood on the Elastic Forum，在基数聚合上指定 precision_threshold 大大降低了内存消耗！

所以答案是，对于这种特殊类型的查询，ES 的性能至少与 SQL Server 一样好。

【讨论】：

以上是关于具有高基数字段的 ElasticSearch 术语和基数性能的主要内容，如果未能解决你的问题，请参考以下文章

术语聚合性能高基数

多个字段的 ElasticSearch 术语查询

elasticsearch - 聚合返回 key 中的术语，但不是完整的字段，我怎样才能返回完整的字段？

Elasticsearch 中的术语聚合返回单词而不是完整字段值的存储桶

具有高基数的雪花性能调优列

如何在普罗米修斯中对具有高基数的指标发出警报