用于抓取所有评论和回复的 YouTube 数据 API
Posted
技术标签:
【中文标题】用于抓取所有评论和回复的 YouTube 数据 API【英文标题】:YouTube Data API to crawl all comments and replies 【发布时间】:2021-11-05 17:08:55 【问题描述】:我一直在拼命寻找一种解决方案来抓取所有 cmets 和我的研究的相应回复。我很难创建一个包含正确和相应顺序的评论数据的数据框。
我将在这里分享我的代码,以便各位专业人士可以看看并给我一些见解。
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
data = 'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]
print(rauthor)
print(rtext)
data = 'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]
result = pd.DataFrame(data)
result.to_csv('youtube.csv', mode='a',header=False)
print(comment)
print(comment2)
print(comment3)
print(comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
当我这样做时,我的爬虫会收集 cmets,但不会收集某些 cmets 下的一些回复。
如何让它收集 cmets 及其相应的回复并将它们放在一个数据框中?
更新
所以,不知何故,我设法在 Jupyter Notebook 的输出部分提取了我想要的信息。我现在要做的就是将结果附加到数据框中。
这是我更新的代码:
def get_video_comments(service, **kwargs):
comments = []
results = service.commentThreads().list(**kwargs).execute()
while results:
for item in results['items']:
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
if 'replies' in item.keys():
for reply in item['replies']['comments']:
rauthor = reply['snippet']['authorDisplayName']
rtext = reply['snippet']['textDisplay']
rtime = reply['snippet']['publishedAt']
rlike = reply['snippet']['likeCount']
print(rtext)
print(rtime)
print(rauthor)
print('Likes: ', rlike)
print(comment)
print(comment2)
print(comment3)
print("Likes: ", comment4)
print('==============================')
comments.append(comment)
# Check if another page exists
if 'nextPageToken' in results:
kwargs['pageToken'] = results['nextPageToken']
results = service.commentThreads().list(**kwargs).execute()
else:
break
return comments
结果是:
如您所见,========
行下分组的 cmets 是下方的评论和相应回复。
将结果附加到数据框中的好方法是什么?
【问题讨论】:
它收集了一些回复,但不是全部。例如,如果一条评论有 3 条回复,则不会将其添加到 CSV。我做错了什么?! 欢迎来到 SO!您可以编辑您的问题,不必在 cmets 中添加上下文。 @Aumazing_DaNub 能否请您提及您使用的是什么 python youtube API 包装器? @Aumazing_DaNub 你也用data = 'Reply ID':
覆盖了data
变量...,而不是附加到它上面。您应该学习如何将评论结果附加到您的 pandas DataFrame 对象(我将使用带有字典列表的纯 python,而不是 pandas,在这里)。
@Timus 谢谢!我会记住这一点:)
【参考方案1】:
根据官方文档,CommentThreads
资源的属性replies.comments[]
有如下规范:
replies.comments[](列表) 对***评论的一个或多个回复的列表。列表中的每一项都是comment 资源。
该列表包含有限数量的回复,除非列表中的项目数等于
snippet.totalReplyCount
属性的值,否则回复列表只是可用于顶部的回复总数的子集-水平评论。要检索***评论的所有回复,您需要调用Comments.list
方法并使用parentId
请求参数来标识您要检索回复的评论。
因此,如果想要获取与给定***评论相关联的所有回复条目,则必须使用适当查询的Comments.list
API 端点。
我推荐你阅读my answer to a very much related question;分为三个部分:
***评论和相关回复, 属性nextPageToken
和参数pageToken
,和
设计强加的 API 限制。
从一开始,您就必须承认,当这些 cmets 的数量超过某个(未指定)上限时,API(当前实现的)不允许获取与给定视频关联的所有*** cmets绑定。
对于 Python 实现,我建议您按如下方式构建代码:
def get_video_comments(service, video_id):
request = service.commentThreads().list(
videoId = video_id,
part = 'id,snippet,replies',
maxResults = 100
)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
...
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 100
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
请注意,上面的省略号 -- ...
-- 必须替换为填充结构数组的实际代码,该数组由get_video_comments
返回给其调用者。
最简单的方法(用于快速测试)是将...
替换为comments.append(comment)
,然后get_video_comments
的调用者简单地打印(使用json.dump
)从该函数获得的对象。
【讨论】:
请询问更多详情 w.r.t. Python,如果你觉得需要的话。 感谢您的快速响应!我会看一看,让你知道进展如何! 嗨!感谢您的快速响应!因此,如果我尝试使用 Comment.list 端点而不是 CommentThread.list 端点,我是否必须编写全新的代码?我知道我是个小笨蛋,并且想在做这些事情之前了解更多。但是由于一些不可避免的情况,我必须尽快完成这件事:(我想知道你是否可以帮助我更好地理解这些步骤!谢谢! 您不必放弃使用CommentThread.list
,但必须与Comments.list
一起使用。也就是说,对于给定的视频,您将通过 CommentThread.list
迭代所有*** cmets(使用分页),并且对于获得的每个此类***评论,您将迭代(再次使用分页)其所有附回复 cmets 方式为Comments.list
。【参考方案2】:
基于 stvar 的回答和原始出版物 here 我构建了以下代码:
import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
CLIENT_SECRETS_FILE = "client_secret.json" # for more information to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'
def get_authenticated_service():
credentials = None
if os.path.exists('token.pickle'):
with open('token.pickle', 'rb') as token:
credentials = pickle.load(token)
# Check if the credentials are invalid or do not exist
if not credentials or not credentials.valid:
# Check if the credentials have expired
if credentials and credentials.expired and credentials.refresh_token:
credentials.refresh(Request())
else:
flow = InstalledAppFlow.from_client_secrets_file(
CLIENT_SECRETS_FILE, SCOPES)
credentials = flow.run_console()
# Save the credentials for the next run
with open('token.pickle', 'wb') as token:
pickle.dump(credentials, token)
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
def get_video_comments(service, **kwargs):
request = service.commentThreads().list(**kwargs)
comments = []
while request:
response = request.execute()
for comment in response['items']:
reply_count = comment['snippet'] \
['totalReplyCount']
replies = comment.get('replies')
if replies is not None and \
reply_count != len(replies['comments']):
replies['comments'] = get_comment_replies(
service, comment['id'])
# 'comment' is a 'CommentThreads Resource' that has it's
# 'replies.comments' an array of 'Comments Resource'
# Do fill in the 'comments' data structure
# to be provided by this function:
comments.append(comment)
request = service.commentThreads().list_next(
request, response)
return comments
def get_comment_replies(service, comment_id):
request = service.comments().list(
parentId = comment_id,
part = 'id,snippet',
maxResults = 1000
)
replies = []
while request:
response = request.execute()
replies.extend(response['items'])
request = service.comments().list_next(
request, response)
return replies
if __name__ == '__main__':
# When running locally, disable OAuthlib's HTTPs verification. When
# running in production *do not* leave this option enabled.
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()
videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)
with open('youtube_comments', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for row in comments:
# convert the tuple to a list and write to the output file
writer.writerow([row])
它返回一个名为 youtube_cmets 的文件,格式如下:
"'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'topLevelComment': 'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': 'value': 'UCKAa4FYftXsN7VKaPSlCivg', 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z', 'canReply': True, 'totalReplyCount': 0, 'isPublic': True"
"'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'topLevelComment': 'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': 'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA', 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z', 'canReply': True, 'totalReplyCount': 3, 'isPublic': True, 'replies': 'comments': ['kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': 'value': 'UCrpB9oZZZfmBv1aQsxrk66w', 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z', 'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': 'value': 'UC04N8BM5aUwDJf-PNFxKI-g', 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z', 'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': 'value': 'UCvv6QMokO7KcJCDpK6qZg3Q', 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z']"
现在有必要进行第二步以获取所需的信息。为此,我使用了一组 bash 脚本,例如 cut、awk 和 set:
cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/print "------------------------****---------::: Replies: "$6" :::---------******--------------------------------"!/replies/print' |sed '/^textOriginal:/,/^authorDisplayName://^authorDisplayName/!d' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/textDisplay/\ntextDisplay/' |sed '/^snippet:/d' | awk -F":" '(NF==1)print "========================================COMMENT==========================================="(NF>1)a=0; print $0' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\2\:[0-9]\2\:[0-9]\2\Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file
最终结果是一个名为 output_file 的文件,格式如下:
========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------::: Replies: 3, :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03
python脚本需要token.pickle文件才能工作,它是在python脚本第一次运行时生成的,当它过期时,必须删除并重新生成。
【讨论】:
由于您在下面忽略了我的回答,而且由于您明确使用了 GNU+Linux/Unix 电动工具,我建议您检查我的工具:Json-Type: JSON Push Parsing and Type Checking。 Json-Type 能够非常有效地解析和类型检查 JSON 文件/流,并且还能够非常有效地从这些来源中提取信息。请参阅README file,特别是第 4.g 小节,从 JSON 输入中提取(表格)数据。 @stvar,感谢您的建议以上是关于用于抓取所有评论和回复的 YouTube 数据 API的主要内容,如果未能解决你的问题,请参考以下文章