从 BigQuery reddit 语料库打印给定月份的新 Redditor 列表
Posted
技术标签:
【中文标题】从 BigQuery reddit 语料库打印给定月份的新 Redditor 列表【英文标题】:Printing the list new Redditors for a given month from the BigQuery reddit corpus 【发布时间】:2016-09-20 02:19:39 【问题描述】:我想打印 2010 年每个月在过去 12 个月内未发表 NOR 评论的 redditor 列表。为此,我使用 BigQuery 上的 reddit 评论/帖子语料库。
这是我在 2010 年 1 月为获得新用户而采取的措施:
SELECT author FROM [fh-bigquery:reddit_comments.2010] WHERE created_utc <= 1264982399 AND author NOT IN
(SELECT author FROM [fh-bigquery:reddit_comments.2009]) AND author NOT IN (SELECT author FROM [fh-bigquery:reddit_posts.full_corpus_201512] WHERE created_utc <1420070400)
当我在 BigQuery 上运行此程序时,我收到“资源超出”错误。如果我删除 [fh-bigquery:reddit_posts.full_corpus_201512] 的选择语句,查询运行正常。
因此,目前,我只能获取那些在过去 12 个月内没有发表评论的用户,但他们可能在那段时间发表过评论。我想要一个用户列表,他们在 reddit 上的“第一个”活动是发表评论。
【问题讨论】:
【参考方案1】:如果您删除重复项,它会起作用:
SELECT author
FROM [fh-bigquery:reddit_comments.2010]
WHERE created_utc <= 1264982399
AND author NOT IN (
SELECT author
FROM [fh-bigquery:reddit_comments.2009]
GROUP BY 1
)
AND author NOT IN (
SELECT author
FROM [fh-bigquery:reddit_posts.full_corpus_201512]
WHERE created_utc <1420070400
GROUP BY 1
)
GROUP BY 1
【讨论】:
正是我需要的!非常感谢!【参考方案2】:(附在 Felipe 的回答上)SELECT DISTINCT
使用 standard SQL 进行等效查询更简单一些:
SELECT DISTINCT author
FROM `fh-bigquery.reddit_comments.2010`
WHERE created_utc <= 1264982399
AND author NOT IN (
SELECT DISTINCT author
FROM `fh-bigquery.reddit_comments.2009`
)
AND author NOT IN (
SELECT DISTINCT author
FROM `fh-bigquery.reddit_posts.full_corpus_201512`
WHERE created_utc < 1420070400
);
【讨论】:
以上是关于从 BigQuery reddit 语料库打印给定月份的新 Redditor 列表的主要内容,如果未能解决你的问题,请参考以下文章