SQL-jacard 相似度
Posted
技术标签:
【中文标题】SQL-jacard 相似度【英文标题】:SQL- jaccard similarity 【发布时间】:2016-04-18 22:09:32 【问题描述】:我的表格如下:
author | group
daniel | group1,group2,group3,group4,group5,group8,group10
adam | group2,group5,group11,group12
harry | group1,group10,group15,group13,group15,group18
...
...
我希望我的输出看起来像:
author1 | author2 | intersection | union
daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam | harry| 0 | 10
谢谢你
【问题讨论】:
您应该将组列表存储为字符串。这将是非常困难的。您应该使用适当的连接表。 【参考方案1】:在下面尝试(适用于 BigQuery)
SELECT
a.author AS author1,
b.author AS author2,
SUM(a.item=b.item) AS intersection,
EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author
GROUP BY 1,2
为 BigQuery 标准 SQL 添加了解决方案
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, SPLIT(grp) AS grp
FROM YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
(SELECT COUNT(1) FROM a.grp) AS count1,
(SELECT COUNT(1) FROM b.grp) AS count2,
(SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
(SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author
我喜欢这个:
更简单/更友好的代码 不需要 CROSS JOIN 和额外的 GROUP BYWhen/If try - 确保取消选中 Show Options
下的Use Legacy SQL
复选框
【讨论】:
【参考方案2】:受 Mikhail Berlyant 的第二个答案的启发,这里基本上是为 Presto 重新格式化的相同方法(作为不同风格 SQL 的另一个示例)。再次感谢 Mikhail。
WITH
YourTable AS (
SELECT
'daniel' AS author,
'group1,group2,group3,group4,group5,group8,group10' AS grp
UNION ALL
SELECT
'adam' AS author,
'group2,group5,group11,group12' AS grp
UNION ALL
SELECT
'harry' AS author,
'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT
author,
SPLIT(grp, ',') AS grp
FROM
YourTable
)
SELECT
a.author AS author1,
b.author AS author2,
CARDINALITY(a.grp) AS count1,
CARDINALITY(b.grp) AS count2,
CARDINALITY(ARRAY_INTERSECT(a.grp, b.grp)) AS intersection_count,
CARDINALITY(ARRAY_UNION(a.grp, b.grp)) AS union_count
FROM tempTable a
JOIN tempTable b ON a.author < b.author
;
请注意,harry
和 union_count
的计数会略有不同,因为它只计算唯一条目,例如harry
有两个 group15
值,但只会计算一个:
author1 | author2 | count1 | count2 | intersection_count | union_count
---------+---------+--------+--------+--------------------+-------------
daniel | harry | 7 | 5 | 2 | 10
adam | harry | 4 | 5 | 0 | 9
adam | daniel | 4 | 7 | 2 | 9
【讨论】:
【参考方案3】:我建议这个扩展性更好的选项:
WITH YourTable AS (
SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
SELECT author, grp
FROM YourTable, UNNEST(SPLIT(grp)) as grp
),
intersection AS (
SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
FROM tempTable a
JOIN tempTable b
USING (grp)
WHERE a.author > b.author
GROUP BY a.author, b.author
),
count_distinct_groups AS (
SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
FROM tempTable
GROUP BY author
),
join_it AS (
SELECT
intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
FROM
intersection
JOIN
count_distinct_groups cg1
ON
intersection.author1 = cg1.author
JOIN
count_distinct_groups cg2
ON
intersection.author2 = cg2.author
)
SELECT
*,
count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
join_it
大数据(数万 x 数百万)的完全交叉连接因洗牌过多而失败,而第二个提案需要数小时才能执行。这需要几分钟。
这种方法的结果是不会出现没有交集的对,因此使用它来处理 IFNULL 的进程将负责。
最后一个细节:Daniel 和 Harry 的并集是 10 而不是 11,因为 group15 在初始示例中重复。
【讨论】:
【参考方案4】:2021 年更新
在官方公众bqutil
项目中尝试jaccard()功能:
例子:
SELECT bqutil.fn.jaccard('thanks', 'thanxs')
输出:
0.71
【讨论】:
以上是关于SQL-jacard 相似度的主要内容,如果未能解决你的问题,请参考以下文章