在多对多连接表中，如何计算两个“所有者”共享的条目数？

Posted 2023-02-24

技术标签:

【中文标题】在多对多连接表中，如何计算两个“所有者”共享的条目数？【英文标题】：In a many-to-many join table, how can I count the number of entries shared by two "owners"? 【发布时间】：2021-04-08 00:13:21 【问题描述】：

我有一个电影列表和一个比喻列表。为了计算两部电影之间的相似度，我使用cosine differences。如果所有的权重都是偶数，那么它就可以很好地简化：

similarity =

(number of shared tropes between both movies)
/
(SQRT(number of tropes from movie 1) + SQRT(number of tropes from movie 2))

例如，如果电影 1 有 1、3 和 4 的比喻，而电影 2 有 1、4、6 和 7 的比喻，那么它们之间将共享两个比喻，并且相似度为

2 / (SQRT(3) + SQRT(4)) = 2 / 3.73... = 0.54

我的 mysql 表非常标准：

movies:
- id
- ...

tropes:
- id
- ...

movie_tropes:
- movie_id
- trope_id

我可以很容易地数出一部电影的比喻数量：

SELECT count(distinct trope_id) from movie_tropes where movie_id = 1;
SELECT count(distinct trope_id) from movie_tropes where movie_id = 2;

我对 SQL 有点不习惯。是否有一种简单的 join-y 方法来计算此连接表中电影 1 和电影 2 出现的 trope_id 数量？

【问题讨论】：

【参考方案1】：

有没有一种简单的方法来计算电影 1 和电影 2 出现的 trope_id 的数量？

您可以自行加入：

select count(distinct trope_id)
from movie_tropes t1
inner join movie_tropes t2 on t2.trope_id = t1.trope_id
where t1.movie_id = 1 and t2.movie_id = 2

但总的来说，您可以使用两个聚合级别一次计算三个基数。我会推荐：

select 
    sum(has_1) as cnt_1,            -- count of distinct tropes for movie 1
    sum(has_2) as cnt_2,            -- count of distinct tropes for movie 2
    sum(has_1 and has_2) as cnt_both  -- count of distinct tropes for both movies
from (
    select max(movie_id = 1) has_1, max(movie_id = 2) as has_2
    from movie_tropes t
    where movie_id in (1, 2)
    group by trope_id
) t

【讨论】：

我理解总和是如何工作的，一旦你有 0 和 1 的列代表每个比喻包含在电影中（或不包含）。内部选择如何工作，每个比喻如何选择 1 和 0？这里，内部查询按trope_id 分组，如果trope_id 出现在movie_id = 1 中，那么max(movie_id = 1) 将返回1，否则返回0。与movie_id = 2 相同。然后基于二进制1和0，使用sum进行外部查询

以上是关于在多对多连接表中，如何计算两个“所有者”共享的条目数？的主要内容，如果未能解决你的问题，请参考以下文章