SQL查找互惠关系
Posted
技术标签:
【中文标题】SQL查找互惠关系【英文标题】:SQL Find Reciprocal Relationship 【发布时间】:2019-03-04 19:49:00 【问题描述】:我正在尝试使用 Stack Exchange 数据资源管理器 (SEDE) 查找一种情况,其中 Stack Overflow 上的两个不同用户已经接受了彼此的回答。比如:
Post A Id: 1, OwnerUserId: "user1", AcceptedAnswerId: "user2"
和
Post B Id: 2, OwnerUserId: "user2", AcceptedAnswerId: "user1"
我目前有一个查询,可以找到两个用户在多个问题上合作作为提问者 - 回答者,但它不能确定这种关系是否是互惠的:
SELECT user1.Id AS User_1, user2.Id AS User_2
FROM Posts p
INNER JOIN Users user1 ON p.OwnerUserId = user1.Id
INNER JOIN Posts p2 ON p.AcceptedAnswerId = p2.Id
INNER JOIN Users user2 ON p2.OwnerUserId = user2.Id
WHERE p.OwnerUserId <> p2.OwnerUserId
AND p.OwnerUserId IS NOT NULL
AND p2.OwnerUserId IS NOT NULL
AND user1.Id <> user2.Id
GROUP BY user1.Id, user2.Id HAVING COUNT(*) > 1;
对于不熟悉架构的人来说,有两个这样的表:
Posts
--------------------------------------
Id int
PostTypeId tinyint
AcceptedAnswerId int
ParentId int
CreationDate datetime
DeletionDate datetime
Score int
ViewCount int
Body nvarchar (max)
OwnerUserId int
OwnerDisplayName nvarchar (40)
LastEditorUserId int
LastEditorDisplayName nvarchar (40)
LastEditDate datetime
LastActivityDate datetime
Title nvarchar (250)
Tags nvarchar (250)
AnswerCount int
CommentCount int
FavoriteCount int
ClosedDate datetime
CommunityOwnedDate datetime
和
Users
--------------------------------------
Id int
Reputation int
CreationDate datetime
DisplayName nvarchar (40)
LastAccessDate datetime
WebsiteUrl nvarchar (200)
Location nvarchar (100)
AboutMe nvarchar (max)
Views int
UpVotes int
DownVotes int
ProfileImageUrl nvarchar (200)
EmailHash varchar (32)
AccountId int
【问题讨论】:
有人系统地浏览了我的个人资料,并否决了我在 *** 上提供的所有问题和答案。管理员似乎也满足于让它发生。因此,我会要求该帖子的任何后续访问者请真诚地投票并发表评论,解释您为什么认为该帖子有用或没有发现该帖子有用。我不想沉迷于一些可悲的针锋相对的事情,我只是想帮助下一个人。谢谢! 【参考方案1】:最简单形式的查询(这样查询 1600 万个问题不会超时)是:
WITH accepter_acceptee(a, b) AS (
SELECT q.OwnerUserId, a.OwnerUserId
FROM Posts AS q
INNER JOIN Posts AS a ON q.AcceptedAnswerId = a.Id
WHERE q.PostTypeId = 1 AND q.OwnerUserId <> a.OwnerUserId
), collaborations(a, b, type) AS (
SELECT a, b, 'a accepter b' FROM accepter_acceptee
UNION ALL
SELECT b, a, 'a acceptee b' FROM accepter_acceptee
)
SELECT a, b, COUNT(*) AS [collaboration count]
FROM collaborations
GROUP BY a, b
HAVING COUNT(DISTINCT type) = 2
ORDER BY a, b
结果:
Original Revision【讨论】:
【参考方案2】:使用来自Salman A's answer 的技术,改进了排序并添加了一些更有用的列。
结合my other answer中的查询,它显示了一些有趣的关系。
See it in SEDE.
WITH QandA_users AS (
SELECT q.OwnerUserId AS userQ
, a.OwnerUserId AS userA
FROM Posts q
INNER JOIN Posts a ON q.AcceptedAnswerId = a.Id
WHERE q.PostTypeId = 1
),
pairsUnion (user1, user2, whoAnswered) AS (
SELECT userQ, userA, 'usr 2 answered'
FROM QandA_users
WHERE userQ <> userA
UNION ALL
SELECT userA, userQ, 'usr 1 answered'
FROM QandA_users
WHERE userQ <> userA
),
collaborators AS (
SELECT user1, user2, COUNT(*) AS [Reciprocations]
FROM pairsUnion
GROUP BY user1, user2
HAVING COUNT (DISTINCT whoAnswered) > 1
)
SELECT
'site://u/' + CAST(c.user1 AS NVARCHAR) + '|Usr ' + u1.DisplayName AS [User 1]
, 'site://u/' + CAST(c.user2 AS NVARCHAR) + '|Usr ' + u2.DisplayName AS [User 2]
, c.Reciprocations AS [Reciprocal Accptd posts]
, (SELECT COUNT(*) FROM QandA_users qau WHERE qau.userQ = c.user1) AS [Usr 1 Qstns wt Accptd]
, (SELECT COUNT(*) FROM QandA_users qau WHERE qau.userQ = c.user1 AND qau.userA = c.user2) AS [Accptd Ansr by Usr 2]
, (SELECT COUNT(*) FROM QandA_users qau WHERE qau.userA = c.user2) AS [Usr 2 Ttl Accptd Answrs]
FROM collaborators c
INNER JOIN Users u1 ON u1.Id = c.user1
INNER JOIN Users u2 ON u2.Id = c.user2
ORDER BY c.Reciprocations DESC
, u1.DisplayName
, u2.DisplayName
结果如下:
【讨论】:
【参考方案3】:预计到达时间:糟糕。误读问题; Op 想要 已接受 的答案,以下是 任何 互惠的答案。 (很容易修改,但我对后者更感兴趣。)
由于数据集非常大(并且需要不使 SEDE 超时),我选择限制集合 AMAP 并从那里构建。
所以这个查询:
-
仅当存在互惠关系时才返回任何行。
返回所有此类问答对。
不包括自我回答。
利用SEDE's query parameters and magic columns 提高可用性。
See it live in SEDE.
-- UserA: Enter ID of user A
-- UserB: Enter ID of user B
WITH possibleAnswers AS (
SELECT
a.Id AS aId
, a.ParentId AS qId
, a.OwnerUserId
, a.CreationDate
FROM Posts a
WHERE a.PostTypeId = 2 -- answers
AND a.OwnerUserId IN (##UserA:INT##, ##UserB:INT##)
),
possibleQuestions AS (
SELECT
q.Id AS qId
, q.OwnerUserId
, q.Tags
FROM Posts q
INNER JOIN possibleAnswers pa ON q.Id = pa.qId
WHERE q.PostTypeId = 1 -- questions
AND q.OwnerUserId IN (##UserA:INT##, ##UserB:INT##)
AND q.OwnerUserId != pa.OwnerUserId -- No self answers
)
SELECT
pa.OwnerUserId AS [User Link]
, 'answers' AS [Action]
, pq.OwnerUserId AS [User Link]
, pa.CreationDate AS [at]
, pq.qId AS [Post Link]
, pq.Tags
FROM possibleQuestions pq
INNER JOIN possibleAnswers pa ON pq.qId = pa.qId
WHERE pq.OwnerUserId = ##UserB:INT##
AND EXISTS (SELECT * FROM possibleQuestions pq2 WHERE pq2.OwnerUserId = ##UserA:INT##)
UNION ALL SELECT
pa.OwnerUserId AS [User Link]
, 'answers' AS [Action]
, pq.OwnerUserId AS [User Link]
, pa.CreationDate AS [at]
, pq.qId AS [Post Link]
, pq.Tags
FROM possibleQuestions pq
INNER JOIN possibleAnswers pa ON pq.qId = pa.qId
WHERE pq.OwnerUserId = ##UserA:INT##
AND EXISTS (SELECT * FROM possibleQuestions pq2 WHERE pq2.OwnerUserId = ##UserB:INT##)
ORDER BY pa.CreationDate
它会产生类似的结果(点击查看大图):
有关所有此类用户对的列表,请参阅this SEDE query。
【讨论】:
【参考方案4】:一个CTE
和一个简单的inner joins
就可以完成这项工作。正如我在其他答案中观察到的那样,不需要那么多代码。注意我的很多 cmets。
链接到 StackExchange Data Explorer 并保存样本结果
with questions as ( -- this is needed so that we have ids of users asking and answering
select
p1.owneruserid as question_userid
, p2.owneruserid as answer_userid
--, p1.id -- to view sample ids
from posts p1
inner join posts p2 on -- to fetch answer post
p1.acceptedanswerid = p2.id
)
select distinct -- unique pairs
q1.question_userid as userid1
, q1.answer_userid as userid2
--, q1.id, q2.id -- to view sample ids
from questions q1
inner join questions q2 on
q1.question_userid = q2.answer_userid -- accepted answer from someone
and q1.answer_userid = q2.question_userid -- who also accepted our answer
and q1.question_userid <> q1.answer_userid -- and we aren't self-accepting
这里以帖子为例:
Can I run rubygems in ironruby? 由 Xian 提出,接受了 Orion Edwards 的回答 Will the Garbage Collector call IDisposable.Dispose for me? 由 Orion Edwards 提出,已接受 Xian 的回答不过,由于数据集和distinct
部分较大,StackExchange 可能会让您超时。如果您想查看一些数据,请删除distinct
并在开头添加top N
:
with questions as (
...
)
select top 3 ...
【讨论】:
【参考方案5】:这就是我的做法。以下是一些简化的数据:
if object_id('tempdb.dbo.#Posts') is not null drop table #Posts
create table #Posts
(
PostId char(1),
OwnerUserId int,
AcceptedAnswerUserId int
)
insert into #Posts
values
('A', 1, 2),
('B', 2, 1),
('C', 2, 3),
('D', 2, 4),
('E', 3, 1),
('F', 4, 1)
出于我们的目的,我们并不真正关心PostId
,我们的起点是一组有序的帖子所有者 (OwnerUserId
) 和接受的回答者 (AcceptedAnswerUserId
)。
(虽然没有必要,你可以像这样可视化集合)
select distinct OwnerUserId, AcceptedAnswerUserId
from #Posts
现在我们要查找该集合中这两个字段颠倒的所有条目。 IE。如果一个帖子是另一个帖子的接受回答者,则所有者在哪里。因此,如果一对是 (1, 2),我们想要找到 (2, 1)。
我使用左连接执行此操作,因此您可以看到它省略的行,但将其更改为内连接会将其限制为您描述的集合。您可以随心所欲地收集信息(通过从帽子中挑选任何一列,或者如果您希望它们位于一行,则从一个表中返回两列)。
select
u1.OwnerUserId,
u1.AcceptedAnswerUserId,
u2.OwnerUserId,
u2.AcceptedAnswerUserId
from #Posts u1
left outer join #Posts u2
on u1.AcceptedAnswerUserId = u2.OwnerUserId
and u1.OwnerUserId = u2.AcceptedAnswerUserId
编辑如果您想排除自己的答案,只需将and u1.AcceptedAnswerUserId != u1.OwnerUserId
添加到on
子句即可。
就个人而言,我一直觉得 SQL 和关系代数在集合论中的根深蒂固很有趣,但是在 SQL 中进行这样的基于集合的操作往往会让人感觉非常笨拙。主要是因为为了保持顺序的缺失,您必须在单个列中表示集合成员。但是为了比较 SQL 中的集合成员,您需要将集合成员表示为单独的列。
现在考虑一下,您如何将其扩展到对同一帖子发表评论的三合会用户?
【讨论】:
以上是关于SQL查找互惠关系的主要内容,如果未能解决你的问题,请参考以下文章