在 SQL 中创建互斥分组(带有对的表)
Posted
技术标签:
【中文标题】在 SQL 中创建互斥分组(带有对的表)【英文标题】:Creating mutually exclusive groupings in SQL (tables with pairs) 【发布时间】:2020-10-02 17:51:27 【问题描述】:寻找一些查询结构的帮助。我有一个表,其中包含链接时间戳、user_id、linked_id、type_if_link 的行。这些链接类型例如是“电子邮件”与“电话号码”,因此在下面的示例中,您可以看到用户 1 没有直接连接到用户 3,而是通过用户 2。另一个复杂之处是每个“链接帐户”都出现在r1 也是如此,这意味着有几个“重复”字段(在示例中:第 1+2 行,第 3+4 行)
例如:
Link time user id linked_id link type
---------------------------------------------------
link_occurred at user 1 user 2 link a
link_occurred at user 2 user 1 link a
link_occurred at user 2 user 3 link b
link_occurred at user 3 user 2 link b
link_occurred_at user 4 user 5 link a
link_occurred_at user 5 user 4 link a
我可以使用哪些函数来获取第一个用户 ID、所有(直接+间接)关联帐户的计数,以及可能的关联帐户 ID 数组。
例如,我想要的输出是:
initial user - Count linked accounts array of linked accounts
--------------------------------------------------------------
user 1 2 linked [user 2, user 3]
user 4 1 linked account [user 5]
这将使我对所有关联的帐户网络进行互斥分组。
【问题讨论】:
可以用递归 CTE 来完成。但是这类问题不太适合 SQL。我们不知道从哪里开始,所以我们必须跟踪已经是网络一部分的所有行。不能很好地适应大桌子...... 【参考方案1】:直到 Erwin Brandstetter 在上面的评论中提到它们,我才知道递归 CTE。这个概念就像它听起来的那样:一个引用自身的 CTE,并且有一个基本情况,以便递归终止。对于您的问题,递归 CTE 解决方案可能类似于:
WITH accumulate_users AS (
-- Base case: the direct links from a user_id.
SELECT
user_id AS user_id,
ARRAY_AGG(linked_id) AS linked_accounts
FROM your_table
GROUP BY user_id
UNION ALL
-- Recursive case: transitively linked accounts.
SELECT
ARRAY_UNION(
accumulate_users.linked_accounts,
ARRAY_AGG(DISTINCT your_table.linked_id)
) AS linked_accounts
FROM accumulate_users
JOIN your_table ON CONTAINS(accumulate_users.linked_accounts, your_table.user_id)
GROUP BY accumulate_users.user_id
-- But there is no enforced termination condition, hopefully it just
-- ends at some point? This is part of why implementing recursive CTEs
-- is challenging, I think.
)
SELECT
user_id,
CARDINALITY(linked_accounts) AS count_linked_accounts,
linked_accounts
FROM accumulate_users
但是,我无法测试这个查询,因为as detailed in another Stack Overflow Q&A Presto does not support recursive CTEs。
可以通过重复连接回您拥有的表来遍历任意但有限数量的链接。像这样的东西,为了清楚起见,我包含了 second_、third_、fourth_degree_links:
SELECT
yt1.user_id,
ARRAY_AGG(DISTINCT yt2.user_id) AS first_degree_links,
ARRAY_AGG(DISTINCT yt3.user_id) AS second_degree_links,
ARRAY_AGG(DISTINCT yt3.linked_user) AS fourth_degree_links,
ARRAY_UNION(
ARRAY_AGG(DISTINCT yt2.user_id),
ARRAY_UNION(ARRAY_AGG(DISTINCT yt3.user_id), ARRAY_AGG(DISTINCT yt3.linked_user))
) AS up_to_fourth_degree_links
FROM your_table AS yt1
JOIN your_table AS yt2 ON yt1.linked_user = yt2.user_id
JOIN your_Table AS yt3 ON yt2.linked_user = yt3.user_id
GROUP BY yt1.user_id
我一直在处理一组类似的数据,尽管我将原始标识符作为原始数据集的一部分。换句话说,您的示例中的“电子邮件”和“电话号码”。我发现创建一个通过这些连接标识符对用户 ID 进行分组的表很有帮助:
CREATE TABLE email_connections AS
SELECT
email,
ARRAY_AGG(DISTINCT user_id) AS users
FROM source_table
GROUP BY email
然后可以通过查找用户数组之间的交集来计算相同的任意但有限深度的链接集:
SELECT
3764350 AS user_id,
FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users))))) AS all_users,
CARDINALITY(FLATTEN(ARRAY_AGG(ARRAY_UNION(emails1.users, ARRAY_UNION(emails2.users, ARRAY_UNION(emails3.users, emails4.users)))))) AS count_all_users
FROM email_connections AS emails1
JOIN email_connections AS emails2 ON CARDINALITY(ARRAY_INTERSECT(emails1.users, emails2.users)) > 0
JOIN email_connections AS emails3 ON CARDINALITY(ARRAY_INTERSECT(emails2.users, emails3.users)) > 0
JOIN email_connections AS emails4 ON CARDINALITY(ARRAY_INTERSECT(emails3.users, emails4.users)) > 0
WHERE CONTAINS(emails1.users, 3764350)
GROUP BY 1
计算任意深度的链接对于Neo4j 或JanusGraph 等图形数据库技术来说是一个很好的用例。这就是我现在要解决的“用户链接”问题。
【讨论】:
以上是关于在 SQL 中创建互斥分组(带有对的表)的主要内容,如果未能解决你的问题,请参考以下文章
在带有分组约束的 sklearn (python 2.7) 中创建训练、测试和交叉验证数据集?
可以在 Excel 中的 PowerPivot 中创建维度和度量分组吗?