用于抓取所有评论和回复的 YouTube 数据 API

Posted

技术标签:

【中文标题】用于抓取所有评论和回复的 YouTube 数据 API【英文标题】:YouTube Data API to crawl all comments and replies 【发布时间】:2021-11-05 17:08:55 【问题描述】:

我一直在拼命寻找一种解决方案来抓取所有 cmets 和我的研究的相应回复。我很难创建一个包含正确和相应顺序的评论数据的数据框。

我将在这里分享我的代码,以便各位专业人士可以看看并给我一些见解。

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    data = 'Reply ID': [rauthor], 'Reply Time': [rtime], 'Reply Comments': [rtext], 'Reply Likes': [rlike]
                    print(rauthor)
                    print(rtext)
            data = 'Comment':[comment],'Date':[comment2],'ID':[comment3], 'Likes':[comment4]
            result = pd.DataFrame(data)
            result.to_csv('youtube.csv', mode='a',header=False)
            print(comment)
            print(comment2)
            print(comment3)
            print(comment4)
            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

当我这样做时,我的爬虫会收集 cmets,但不会收集某些 cmets 下的一些回复。

如何让它收集 cmets 及其相应的回复并将它们放在一个数据框中?

更新

所以,不知何故,我设法在 Jupyter Notebook 的输出部分提取了我想要的信息。我现在要做的就是将结果附加到数据框中。

这是我更新的代码:

def get_video_comments(service, **kwargs):
    comments = []
    results = service.commentThreads().list(**kwargs).execute()

    while results:
        for item in results['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comment2 = item['snippet']['topLevelComment']['snippet']['publishedAt']
            comment3 = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
            comment4 = item['snippet']['topLevelComment']['snippet']['likeCount']
            if 'replies' in item.keys():
                for reply in item['replies']['comments']:
                    rauthor = reply['snippet']['authorDisplayName']
                    rtext = reply['snippet']['textDisplay']
                    rtime = reply['snippet']['publishedAt']
                    rlike = reply['snippet']['likeCount']
                    print(rtext)
                    print(rtime)
                    print(rauthor)
                    print('Likes: ', rlike)
                    
            print(comment)
            print(comment2)
            print(comment3)
            print("Likes: ", comment4)

            print('==============================')
            comments.append(comment)
                
        # Check if another page exists
        if 'nextPageToken' in results:
            kwargs['pageToken'] = results['nextPageToken']
            results = service.commentThreads().list(**kwargs).execute()
        else:
            break

    return comments

结果是:

如您所见,======== 行下分组的 cmets 是下方的评论和相应回复。

将结果附加到数据框中的好方法是什么?

【问题讨论】:

它收集了一些回复,但不是全部。例如,如果一条评论有 3 条回复,则不会将其添加到 CSV。我做错了什么?! 欢迎来到 SO!您可以编辑您的问题,不必在 cmets 中添加上下文。 @Aumazing_DaNub 能否请您提及您使用的是什么 python youtube API 包装器? @Aumazing_DaNub 你也用data = 'Reply ID': 覆盖了data 变量...,而不是附加到它上面。您应该学习如何将评论结果附加到您的 pandas DataFrame 对象(我将使用带有字典列表的纯 python,而不是 pandas,在这里)。 @Timus 谢谢!我会记住这一点:) 【参考方案1】:

根据官方文档,CommentThreads资源的属性replies.comments[]有如下规范:

replies.comments[](列表) 对***评论的一个或多个回复的列表。列表中的每一项都是comment 资源。

该列表包含有限数量的回复,除非列表中的项目数等于snippet.totalReplyCount 属性的值,否则回复列表只是可用于顶部的回复总数的子集-水平评论。要检索***评论的所有回复,您需要调用Comments.list 方法并使用parentId 请求参数来标识您要检索回复的评论。

因此,如果想要获取与给定***评论相关联的所有回复条目,则必须使用适当查询的Comments.list API 端点。

我推荐你阅读my answer to a very much related question;分为三个部分:

***评论和相关回复属性nextPageToken和参数pageToken,和 设计强加的 API 限制

从一开始,您就必须承认,当这些 cmets 的数量超过某个(未指定)上限时,API(当前实现的)不允许获取与给定视频关联的所有*** cmets绑定。


对于 Python 实现,我建议您按如下方式构建代码:

def get_video_comments(service, video_id):
    request = service.commentThreads().list(
        videoId = video_id,
        part = 'id,snippet,replies',
        maxResults = 100
    )
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies is not None and \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's
            # 'replies.comments' an array of 'Comments Resource'

            # Do fill in the 'comments' data structure 
            # to be provided by this function:
            ...

        request = service.commentThreads().list_next(
            request, response)

    return comments
def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 100
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies

请注意,上面的省略号 -- ... -- 必须替换为填充结构数组的实际代码,该数组由get_video_comments 返回给其调用者。

最简单的方法(用于快速测试)是将...替换为comments.append(comment),然后get_video_comments的调用者简单地打印(使用json.dump)从该函数获得的对象。

【讨论】:

请询问更多详情 w.r.t. Python,如果你觉得需要的话。 感谢您的快速响应!我会看一看,让你知道进展如何! 嗨!感谢您的快速响应!因此,如果我尝试使用 Comment.list 端点而不是 CommentThread.list 端点,我是否必须编写全新的代码?我知道我是个小笨蛋,并且想在做这些事情之前了解更多。但是由于一些不可避免的情况,我必须尽快完成这件事:(我想知道你是否可以帮助我更好地理解这些步骤!谢谢! 您不必放弃使用CommentThread.list,但必须与Comments.list 一起使用。也就是说,对于给定的视频,您将通过 CommentThread.list 迭代所有*** cmets(使用分页),并且对于获得的每个此类***评论,您将迭代(再次使用分页)其所有附回复 cmets 方式为Comments.list【参考方案2】:

基于 stvar 的回答和原始出版物 here 我构建了以下代码:

import os
import pickle
import csv
import json
import google.oauth2.credentials
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

CLIENT_SECRETS_FILE = "client_secret.json" # for more information  to create your credentials json please visit https://python.gotrained.com/youtube-api-extracting-comments/
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'

def get_authenticated_service():
    credentials = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            credentials = pickle.load(token)
    #  Check if the credentials are invalid or do not exist
    if not credentials or not credentials.valid:
        # Check if the credentials have expired
        if credentials and credentials.expired and credentials.refresh_token:
            credentials.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                CLIENT_SECRETS_FILE, SCOPES)
            credentials = flow.run_console()

        # Save the credentials for the next run
        with open('token.pickle', 'wb') as token:
            pickle.dump(credentials, token)

    return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

def get_video_comments(service, **kwargs):
    request = service.commentThreads().list(**kwargs)
    comments = []

    while request:
        response = request.execute()

        for comment in response['items']:
            reply_count = comment['snippet'] \
                ['totalReplyCount']
            replies = comment.get('replies')
            if replies is not None and \
               reply_count != len(replies['comments']):
               replies['comments'] = get_comment_replies(
                   service, comment['id'])

            # 'comment' is a 'CommentThreads Resource' that has it's
            # 'replies.comments' an array of 'Comments Resource'

            # Do fill in the 'comments' data structure 
            # to be provided by this function:
            comments.append(comment)

        request = service.commentThreads().list_next(
            request, response)

    return comments
def get_comment_replies(service, comment_id):
    request = service.comments().list(
        parentId = comment_id,
        part = 'id,snippet',
        maxResults = 1000
    )
    replies = []

    while request:
        response = request.execute()
        replies.extend(response['items'])
        request = service.comments().list_next(
            request, response)

    return replies


if __name__ == '__main__':
    # When running locally, disable OAuthlib's HTTPs verification. When
    # running in production *do not* leave this option enabled.
    os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
    service = get_authenticated_service()
    videoId = input('Enter Video id : ') # video id here (the video id of https://www.youtube.com/watch?v=vedLpKXzZqE -> is vedLpKXzZqE)
    comments = get_video_comments(service, videoId=videoId, part='id,snippet,replies', maxResults = 1000)


with open('youtube_comments', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in comments:
            # convert the tuple to a list and write to the output file
            writer.writerow([row])

它返回一个名为 youtube_cmets 的文件,格式如下:

"'kind': 'youtube#commentThread', 'etag': 'gvhv4hkH0H2OqQAHQKxzfA-K_tA', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'topLevelComment': 'kind': 'youtube#comment', 'etag': 'qpuKZcuD4FKf6BHgRlMunersEeU', 'id': 'UgzSgI1YEvwcuF4cPwN4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is a comment', 'textOriginal': 'This is a comment', 'authorDisplayName': 'Gabriell Magana', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLRGBvo2ZncDP1xGjlX6anfUufNYi9b3w9kYZFDl=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCKAa4FYftXsN7VKaPSlCivg', 'authorChannelId': 'value': 'UCKAa4FYftXsN7VKaPSlCivg', 'canRate': True, 'viewerRating': 'none', 'likeCount': 8, 'publishedAt': '2019-05-22T12:38:34Z', 'updatedAt': '2019-05-22T12:38:34Z', 'canReply': True, 'totalReplyCount': 0, 'isPublic': True"
"'kind': 'youtube#commentThread', 'etag': 'DsgDziMk7mB7xN4OoX7cmqlbDYE', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'topLevelComment': 'kind': 'youtube#comment', 'etag': 'NYjvYM9W_umBafAfQkdg1P9apgg', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'This is another comment', 'textOriginal': 'This is another comment', 'authorDisplayName': 'Mary Montes', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTg1b1yw8BX8Af0PoTR_t5OOwP9Cfl9_qL-o1iikw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC_GP_8HxDPsqJjJ3Fju_UeA', 'authorChannelId': 'value': 'UC_GP_8HxDPsqJjJ3Fju_UeA', 'canRate': True, 'viewerRating': 'none', 'likeCount': 9, 'publishedAt': '2019-05-15T05:10:49Z', 'updatedAt': '2019-05-15T05:10:49Z', 'canReply': True, 'totalReplyCount': 3, 'isPublic': True, 'replies': 'comments': ['kind': 'youtube#comment', 'etag': 'Tu41ENCZYNJ2KBpYeYz4qgre0H8', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF79DbfJ9zMKxM', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'this is first reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'JULIO EMPRESARIO', 'authorProfileImageUrl': 'https://yt3.ggpht.com/eYP4MBcZ4bON_pHtdbtVsyWnsKbpNKye2wTPhgkffkMYk3ZbN0FL6Aa1o22YlFjn2RVUAkSQYw=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCrpB9oZZZfmBv1aQsxrk66w', 'authorChannelId': 'value': 'UCrpB9oZZZfmBv1aQsxrk66w', 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-09-15T04:06:50Z', 'updatedAt': '2020-09-15T04:06:50Z', 'kind': 'youtube#comment', 'etag': 'OrpbnJddwzlzwGArCgtuuBsYr94', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF795E1w8RV1DJ', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'the second replay', 'textOriginal': 'the second replay', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Anatolio27 Diaz', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLR1hOySIxEkvRCySExHjo3T6zGBNkvuKpPkqA=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UC04N8BM5aUwDJf-PNFxKI-g', 'authorChannelId': 'value': 'UC04N8BM5aUwDJf-PNFxKI-g', 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2020-02-19T18:21:06Z', 'updatedAt': '2020-02-19T18:21:06Z', 'kind': 'youtube#comment', 'etag': 'sPmIwerh3DTZshLiDVwOXn_fJx0', 'id': 'UgytsI51LU6BWRmYtBB4AaABAg.8uwduw6ppF78wwH6Aabh4y', 'snippet': 'videoId': 'tGTaBt4Hfd0', 'textDisplay': 'A third reply', 'textOriginal': 'A third reply', 'parentId': 'UgytsI51LU6BWRmYtBB4AaABAg', 'authorDisplayName': 'Voy detrás de mi pasión', 'authorProfileImageUrl': 'https://yt3.ggpht.com/ytc/AKedOLTgzZ3ZFvkmmAlMzA77ApM-2uGFfvOBnzxegYEX=s48-c-k-c0x00ffffff-no-rj', 'authorChannelUrl': 'http://www.youtube.com/channel/UCvv6QMokO7KcJCDpK6qZg3Q', 'authorChannelId': 'value': 'UCvv6QMokO7KcJCDpK6qZg3Q', 'canRate': True, 'viewerRating': 'none', 'likeCount': 2, 'publishedAt': '2019-07-03T18:45:34Z', 'updatedAt': '2019-07-03T18:45:34Z']"

现在有必要进行第二步以获取所需的信息。为此,我使用了一组 bash 脚本,例如 cut、awk 和 set:

cut -d ":" -f 10- youtube_comments | sed -e "s/', '/\n/g" -e "s/'//g" | awk '/replies/print "------------------------****---------:::   Replies: "$6"  :::---------******--------------------------------"!/replies/print' |sed '/^textOriginal:/,/^authorDisplayName://^authorDisplayName/!d' |sed '/^authorProfileImageUrl:\|^authorChannelUrl:\|^authorChannelId:\|^etag:\|^updatedAt:\|^parentId:\|^id:/d' |sed 's/<[^>]*>//g' | sed 's/textDisplay/\ntextDisplay/' |sed '/^snippet:/d' | awk -F":" '(NF==1)print "========================================COMMENT==========================================="(NF>1)a=0; print $0' | sed 's/textDisplay: //g' | sed 's/authorDisplayName/User/g' | sed 's/T[0-9]\2\:[0-9]\2\:[0-9]\2\Z//g' | sed 's/likeCount: /Likes:/g' | sed 's/publishedAt: //g' > output_file

最终结果是一个名为 output_file 的文件,格式如下:

========================================COMMENT===========================================
This is a comment
User: Robert Everest
Likes:8, 2019-05-22
========================================COMMENT===========================================
This is another comment
User: Anna Davis
Likes:9, 2019-05-15
------------------------****---------:::   Replies: 3,  :::---------******--------------------------------
this is first reply
User: John Doe
Likes:2, 2020-09-15
the second replay
User: Caraqueno
Likes:2, 2020-02-19
A third reply
User: Rebeca
Likes:2, 2019-07-03

python脚本需要token.pickle文件才能工作,它是在python脚本第一次运行时生成的,当它过期时,必须删除并重新生成。

【讨论】:

由于您在下面忽略了我的回答,而且由于您明确使用了 GNU+Linux/Unix 电动工具,我建议您检查我的工具:Json-Type: JSON Push Parsing and Type Checking。 Json-Type 能够非常有效地解析和类型检查 JSON 文件/流,并且还能够非常有效地从这些来源中提取信息。请参阅README file,特别是第 4.g 小节,从 JSON 输入中提取(表格)数据。 @stvar,感谢您的建议

以上是关于用于抓取所有评论和回复的 YouTube 数据 API的主要内容,如果未能解决你的问题,请参考以下文章

在 R 中抓取 Youtube 评论

指定收件人 Youtube api v3 插入评论

如何用爬虫抓取京东商品评价

YouTube 数据 API v3 评论列表

如何用python 爬虫在社交媒体上抓取评论

使用单个 SQL 查询选择所有带有父评论的回复