如何为多个 Inner Join 编写 SQL 查询?

Posted

技术标签:

【中文标题】如何为多个 Inner Join 编写 SQL 查询?【英文标题】:How to write a SQL query for multiple Inner Join? 【发布时间】:2019-12-05 17:08:17 【问题描述】:

样本记录:

    Row(user_id='KxGeqg5ccByhaZfQRI4Nnw', gender='male', year='2015', month='September', day='20', 
hour='16', weekday='Sunday', reviewClass='place love back', business_id='S75Lf-Q3bCCckQ3w7mSN2g', 
business_name='Notorious Burgers', city='Scottsdale', categories='Nightlife, American (New), Burgers, 
Comfort Food, Cocktail Bars, Restaurants, Food, Bars, American (Traditional)', user_funny='1', 
review_sentiment='Positive', friend_id='my4q3Sy6Ei45V58N2l8VGw')

此表有超过 1 亿条记录。我的 SQL 查询正在执行以下操作:

Select the most occurring review_sentiment among the friends (friend_id) and the most occurring gender among friends of a particular user visiting a specific business

friend_id is eventually a user_id

示例场景:

一个用户 已访问 4 家企业 有 10 个朋友 这些朋友中有 5 人访问过企业 1 和 2,而其他 5 人访问过 只去过第三家,没有人去过第四家 现在,对于业务 1 和业务 2,5 个朋友的积极性高于 对 B1 的负面情绪,并且对 B1 的情绪为 -ve 多于 +ve B2 和 B3 的全部 -ve

我想要以下输出:

**user_id | business_id | friend_common_sentiment | mostCommonGender | .... otherCols**

user_id_1 | business_id_1 | positive | male | .... otherCols
user_id_1 | business_id_2 | negative | female | .... otherCols
user_id_1 | business_id_3 | negative | female | .... otherCols

这是我在pyspark 中为此写的一个简单查询:

SELECT user_id, gender, year, month, day, hour, weekday, reviewClass, business_id, business_name, city, 
categories, user_funny, review_sentiment FROM events1 GROUP BY user_id, friend_id, business_id ORDER BY 
COUNT(review_sentiment DESC LIMIT 1

这个查询不会给出预期的结果,但我不确定如何将 INNER-JOIN 完全融入其中?

【问题讨论】:

【参考方案1】:

这种数据结构确实让事情变得困难。但是让我们把它分解成几个步骤,

    您需要自行加入才能获取朋友的数据 获得朋友的数据后,执行聚合函数以获取每个可能值的计数,并按用户和企业分组 对上述内容进行子查询,以便根据计数在值之间做出决策。

我只是将您的表称为“标签”,因此连接如下所示,遗憾的是就像在现实生活中一样,我们不能假设每个人都有朋友,并且由于您没有指定排除永远单独人群,我们需要使用左连接来保持用户没有朋友。

From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
    and friends.business_id = user.business_id

接下来,您必须弄清楚给定用户和业务组合最常见的性别/评论是什么。这就是数据结构真正让我们大吃一惊的地方,我们可以使用一些巧妙的窗口函数一步完成,但我希望这个答案易于理解,所以我将使用子查询和案例陈述。为简单起见,我假设为二元性别,但根据您应用的唤醒级别,您可以遵循相同的模式来获取其他性别。

select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
  and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id

现在我们只需要从子查询中获取数据并做出一些决定,您可能想要添加一些额外的选项,例如您可能想要添加选项以防没有朋友,或者朋友之间平均分配性别/情绪。与以下相同的模式,但有额外的值可供选择。

select user_id
, business_id
, case when MaleFriends > than FemaleFriends then 'Male' else 'Female' as MostCommonGender
, case when FriendsPositive > FriendsNegative then 'Positive' else 'Negative' as MostCommonSentiment
from (    select user.user_id, user.business_id
, sum(case when friends.gender = 'Male' then 1 else 0 end) as MaleFriends
, sum(case when friends.gender = 'Female' then 1 else 0 end) as FemaleFriends
, sum(case when friends.review_sentiment = 'Positive' then 1 else 0 end) as FriendsPositive
, sum(case when friends.review_sentiment = 'Negative' then 1 else 0 end) as FriendsNegative
From tags as user
left outer join tags as friends on user.friend_id = friends.user_id
  and friends.business_id = user.business_id
where user.business_id = <<your business id here>>
group by user.user_id, user.business_id) as a

这为您提供了要遵循的步骤,并希望能清楚地解释它们的工作原理。祝你好运!

【讨论】:

以上是关于如何为多个 Inner Join 编写 SQL 查询?的主要内容,如果未能解决你的问题,请参考以下文章

如何为 GraphQL Mutation 字段编写解析器

多个 INNER JOIN SQL ACCESS

如何在sql中创建INNER JOIN多个表

sql用inner join内关联查询有多条记录一样只取一条?

使用 INNER JOIN 更新 SQL Server 中的多个表 [重复]

如何在 SQL Server 中使用 INNER JOIN 从多个表中删除