countDistinct 和 distinct.count 的区别

Posted 2023-04-17

技术标签:

【中文标题】countDistinct 和 distinct.count 的区别【英文标题】：The difference between countDistinct and distinct.count 【发布时间】：2018-04-18 15:25:23 【问题描述】：

为什么..agg(countDistinct("member_id") as "count") 和..distinct.count 的输出不同？ select count(distinct member_id)和select distinct count(member_id)的区别是一样的吗？

【问题讨论】：

【参考方案1】：

为什么 ..agg(countDistinct("member_id") as "count") 和 ..distinct.count 的输出不同？

因为.distinct.count是一样的：

SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)

而..agg(countDistinct("member_id") as "count") 是

SELECT COUNT(DISTINCT member_id) FROM table

和COUNT(*)uses different rules than COUNT(column) when nulls are encountered。

【讨论】：

【参考方案2】：

df.agg(countDistinct("member_id") as "count")

返回member_id 列的不同值的数量，忽略所有其他列，同时

df.distinct.count

将计算 DataFrame 中不同的记录的数量 - 其中“distinct”表示所有列的值相同。

例如，DataFrame：

+-----------+---------+
|member_name|member_id|
+-----------+---------+
|          a|        1|
|          b|        1|
|          b|        1|
+-----------+---------+

只有一个不同的member_id 值但有两条不同的记录，因此agg 选项将返回1，而后者将返回2。

【讨论】：

那么请在问题中表明...或者至少是数据框的架构。【参考方案3】：

第一个命令：

DF.agg(countDistinct("member_id") as "count")

返回和select count distinct(member_id) from DF一样。

第二条命令：

DF.distinct.count

实际上是在获取不同的记录或从 DF 中删除所有重复项，然后进行计数。

【讨论】：

那么，从技术上讲，输出应该是一样的吗？不。第一个特定于成员 ID 列。第二个在所有列上。更具体地说，第二个在所有列上都有 group by，而第一个在 1 列上有 group by。

以上是关于countDistinct 和 distinct.count 的区别的主要内容，如果未能解决你的问题，请参考以下文章