SQL-jacard 相似度

Posted 2023-03-24

技术标签:

【中文标题】SQL-jacard 相似度【英文标题】：SQL- jaccard similarity 【发布时间】：2016-04-18 22:09:32 【问题描述】：

我的表格如下：

author | group 

daniel | group1,group2,group3,group4,group5,group8,group10
adam   | group2,group5,group11,group12
harry  | group1,group10,group15,group13,group15,group18
...
...

我希望我的输出看起来像：

author1 | author2 | intersection | union

daniel | adam | 2 | 9
daniel | harry| 2 | 11
adam   | harry| 0 | 10

谢谢你

【问题讨论】：

您应该将组列表存储为字符串。这将是非常困难的。您应该使用适当的连接表。 【参考方案1】：

在下面尝试（适用于 BigQuery）

SELECT
  a.author AS author1, 
  b.author AS author2, 
  SUM(a.item=b.item) AS intersection, 
  EXACT_COUNT_DISTINCT(a.item) + EXACT_COUNT_DISTINCT(b.item) - intersection AS [union]
FROM FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS a
CROSS JOIN FLATTEN((
  SELECT author, SPLIT([group]) AS item FROM YourTable
), item) AS b
WHERE a.author < b.author 
GROUP BY 1,2

为 BigQuery 标准 SQL 添加了解决方案

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
  SELECT author, SPLIT(grp) AS grp
  FROM YourTable
)
SELECT 
  a.author AS author1, 
  b.author  AS author2,
  (SELECT COUNT(1) FROM a.grp) AS count1,
  (SELECT COUNT(1) FROM b.grp) AS count2,
  (SELECT COUNT(1) FROM UNNEST(a.grp) AS agrp JOIN UNNEST(b.grp) AS bgrp ON agrp = bgrp) AS intersection_count,
  (SELECT COUNT(1) FROM (SELECT * FROM UNNEST(a.grp) UNION DISTINCT SELECT * FROM UNNEST(b.grp))) AS union_count
FROM tempTable a
JOIN tempTable b
ON a.author < b.author

我喜欢这个：

更简单/更友好的代码不需要 CROSS JOIN 和额外的 GROUP BY

When/If try - 确保取消选中 Show Options

下的 Use Legacy SQL 复选框

【讨论】：

【参考方案2】：

受 Mikhail Berlyant 的第二个答案的启发，这里基本上是为 Presto 重新格式化的相同方法（作为不同风格 SQL 的另一个示例）。再次感谢 Mikhail。

WITH
YourTable AS (
    SELECT
        'daniel' AS author,
        'group1,group2,group3,group4,group5,group8,group10' AS grp
    UNION ALL
    SELECT
        'adam' AS author,
        'group2,group5,group11,group12' AS grp
    UNION ALL
    SELECT
        'harry' AS author,
        'group1,group10,group13,group15,group18' AS grp
),
tempTable AS (
    SELECT
        author,
        SPLIT(grp, ',') AS grp
    FROM
        YourTable
)
SELECT
  a.author AS author1,
  b.author  AS author2,
  CARDINALITY(a.grp) AS count1,
  CARDINALITY(b.grp) AS count2,
  CARDINALITY(ARRAY_INTERSECT(a.grp, b.grp)) AS intersection_count,
  CARDINALITY(ARRAY_UNION(a.grp, b.grp)) AS union_count
FROM tempTable a
JOIN tempTable b ON a.author < b.author
;

请注意，harry 和 union_count 的计数会略有不同，因为它只计算唯一条目，例如harry 有两个 group15 值，但只会计算一个：

 author1 | author2 | count1 | count2 | intersection_count | union_count
---------+---------+--------+--------+--------------------+-------------
 daniel  | harry   |      7 |      5 |                  2 |          10
 adam    | harry   |      4 |      5 |                  0 |           9
 adam    | daniel  |      4 |      7 |                  2 |           9

【讨论】：

【参考方案3】：

我建议这个扩展性更好的选项：

WITH YourTable AS (
  SELECT 'daniel' AS author, 'group1,group2,group3,group4,group5,group8,group10' AS grp UNION ALL
  SELECT 'adam' AS author, 'group2,group5,group11,group12' AS grp UNION ALL
  SELECT 'harry' AS author, 'group1,group10,group13,group15,group18' AS grp
),

tempTable AS (
  SELECT author, grp
  FROM YourTable, UNNEST(SPLIT(grp)) as grp
),

intersection AS (
  SELECT a.author AS author1, b.author AS author2, COUNT(1) as intersection
  FROM tempTable a 
  JOIN tempTable b
  USING (grp)
  WHERE a.author > b.author
  GROUP BY a.author, b.author
),

count_distinct_groups AS (
  SELECT author, COUNT(DISTINCT grp) as count_distinct_groups
  FROM tempTable
  GROUP BY author
),

join_it AS (
  SELECT
    intersection.*, cg1.count_distinct_groups AS count_distinct_groups1, cg2.count_distinct_groups AS count_distinct_groups2
  FROM
    intersection
  JOIN
    count_distinct_groups cg1
  ON
    intersection.author1 = cg1.author
  JOIN
    count_distinct_groups cg2
  ON
    intersection.author2 = cg2.author
)

SELECT
  *,
  count_distinct_groups1 + count_distinct_groups2 - intersection AS unionn,
  intersection / (count_distinct_groups1 + count_distinct_groups2 - intersection) AS jaccard
FROM
  join_it

大数据（数万 x 数百万）的完全交叉连接因洗牌过多而失败，而第二个提案需要数小时才能执行。这需要几分钟。

这种方法的结果是不会出现没有交集的对，因此使用它来处理 IFNULL 的进程将负责。

最后一个细节：Daniel 和 Harry 的并集是 10 而不是 11，因为 group15 在初始示例中重复。

【讨论】：

【参考方案4】：

2021 年更新

在官方公众bqutil项目中尝试jaccard()功能：

例子：

SELECT bqutil.fn.jaccard('thanks', 'thanxs')

输出：

0.71

【讨论】：

以上是关于SQL-jacard 相似度的主要内容，如果未能解决你的问题，请参考以下文章