Distributed_group_by_no_merge 的作用是啥

Posted 2023-03-25

技术标签:

【中文标题】Distributed_group_by_no_merge 的作用是啥【英文标题】：what is the effect of distributed_group_by_no_mergeDistributed_group_by_no_merge 的作用是什么 【发布时间】：2020-05-12 03:16:00 【问题描述】：

我知道分布式节点不会使用 distributed_group_by_no_merge 组合来自分片的中间结果。

下面的SQL

select sum(xxxxx),xxxxx from (
    select sum(xxxx),xxxx 
    from (
        select count(xxx),xxx 
        from distributed_table group by xxx )  
    group by xxxx SETTINGS distributed_group_by_no_merge = 1
) group by xxxxx

我想知道使用distributed_group_by_no_merge会发送哪部分sql到MergeTree节点执行？是吗？select count(xxx),xxx fromdistributed_table group by xxx ) group by xxxx SETTINGS Distributed_group_by_no_merge = 1

distributed_group_by_no_merge的参数如何改变分布式查询的行为？哪一部分sql在MergeTree节点上执行，哪一部分sql在分布式节点上执行？

【问题讨论】：

【参考方案1】：

distributed_group_by_no_merge-param 影响发起者节点（它是一个运行分布式查询的节点）如何形成分布式查询的最终结果：

通过自行合并来自分片的aggregated intermediate states（它需要将完整的聚合中间状态从分片复制到启动器节点）[distributed_group_by_no_merge = 0（默认模式）]

或者已经从分片获得最终结果（当每个分片在其一侧合并一个中间聚合状态并仅将最终结果发送给发起者节点时）。它显着提高了性能和资源消耗，但需要正确选择分片键 [distributed_group_by_no_merge = 1]

我会将 distributed_group_by_no_merge 放在定义分布式表的子查询的同一级别，以明确定义您的意图并避免在有多个分布式子查询时混淆。

让我们看看如何检查两种模式之间的差异（将使用_shard_num-virtual列）：

distributed_group_by_no_merge=0

SELECT
    groupUniqArray(_shard_num) AS shards,
    ..
FROM table
WHERE ..
GROUP BY ..
SETTINGS distributed_group_by_no_merge = 0

/* Aggregated states were merged into ONE result set on initiator-node.
┌─shards────┬─ ..
│ [2, 1, 3] │  ..
└───────────┴─ ..
*/

distributed_group_by_no_merge=1

SELECT
    groupUniqArray(_shard_num) AS shards,
    ..
FROM table
WHERE ..
GROUP BY ..
SETTINGS distributed_group_by_no_merge = 1

/* Get a set of final results (not aggregated states) from each shard. They should be unioned manually.
┌─shards─┬─ ..
│ [2]    │  ..
│ [1]    │  ..
│ [3]    │  ..
└────────┴─ ..
*/

How to avoid merging high cardinality sub-select aggregations on distributed tables

【讨论】：

1.当不使用distributed_group_by_no_merge时，子查询只有distributed_table查询的内部部分将在shads上执行，所有其他部分（）将在发起节点执行。 why not array join execute on mergetree node 2.在同级子查询使用distributed_group_by_no_merge=1时，同级子查询会在shards上执行，外部聚合查询会在分布式节点上执行。对吗？ 1) 我更新了我的答案，请看。 2) 是的，“外部聚合”将独立于 distributed_group_by_no_merge-param 的值在初始节点上运行。关键区别在于在哪里计算分布式查询的最终结果 - 在发起者节点或每个分片上分别计算（了解 中间聚合状态 和 中间聚合状态的合并有助于澄清这个话题）。

以上是关于Distributed_group_by_no_merge 的作用是啥的主要内容，如果未能解决你的问题，请参考以下文章