爬取知乎“凡尔赛语录”话题下的所有回答,我知道点开看你的很帅气,但还是没我帅

Posted 世上本无鬼

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬取知乎“凡尔赛语录”话题下的所有回答,我知道点开看你的很帅气,但还是没我帅相关的知识,希望对你有一定的参考价值。

凡尔赛文学火了。这种特殊的网络文体,常出现在朋友圈或微博,以波澜不惊的口吻,假装不经意地炫富、秀恩爱。
普通的炫耀,无非在社交网络发发跑车照片,或不经意露出名牌包包 logo,但凡尔赛文学还不这么直接。微博博主还专门制作过凡尔赛文学教学视频,讲解其三大精髓要素:

在豆瓣上,也有一个名叫凡尔赛学研习小组,组员们将凡尔赛定义为一种表演高级人生的精神,好了,进入主题,今天来快速爬取知乎里有关凡尔赛语录有关的回答,开始。

1.爬取的网站

在知乎搜索凡尔赛语录,第二个比较适合,就用这个。

点进去后可以发现关于这个提问共有 393 个回答。

网址:https://www.zhihu.com/question/429548386/answer/1575062220

去掉 answer 以及后面的部分就是这个要爬取的问题网址。特别是后面的一串数字是问题 id:https://www.zhihu.com/question/429548386,作为知乎问题的唯一标识。

2.爬取问题有关的回答

研究一下上面的网址,我们发现需要爬取两部分数据:

  1. 爬取的详情,包括创建时间、关注人数、浏览量、问题描述等
  2. 爬取的回答,包括每个答主的用户名、粉丝数等信息,问题回答的具体内容、发布时间、评论数、点赞数等信息

其中,这个问题详情可以直接爬取上面的网址,通过 bs4 解析页面内容拿到数据,而问题的回答则需要通过下面的链接,通过设置每页的起始下标和页面内容偏移量确定,有点类似于分页内容的爬取。

def init_url(question_id, limit, offset):  
    base_url_start = "https://www.zhihu.com/api/v4/questions/"  
    base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics&limit={0}&offset={1}".format(limit, offset)  

    return base_url_start + question_id + base_url_end

设置每页回答数 limit=20,offset 则可以是0、20、40…而 question_id 则是上面提到的网址后面的一串数字,这里是 429548386,逻辑想明白之后就是通过写爬虫获取数据了,下面是完整的爬虫代码,运行的时候你只需要修改问题的 id 即可。

3.完整代码

# 导入相应的库
import json
import re
import time
from datetime import datetime
from time import sleep
import pandas as pd
import numpy as np
import warnings
import requests
from bs4 import BeautifulSoup
import random
import warnings
warnings.filterwarnings('ignore')


def get_ua():
    """
    在UA库中随机选择一个UA
    :return: 返回一个库中的随机UA
    """
    ua_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (Khtml, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
        "Opera/8.0 (Windows NT 5.1; U; en)",
        "Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
        "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
        "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
        "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.2.149.27 Safari/525.13",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Macintosh; U; IntelMac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1Safari/534.50",
        "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]

    return random.choice(ua_list)
    

def filter_emoij(text):
    """
    过滤emoij表情符
    @param text:
    @return:
    """
    try:
        co = re.compile(u'[\\U00010000-\\U0010ffff]')
    except re.error:
        co = re.compile(u'[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]')
    text = co.sub('', text)

    return text


def get_question_base_info(url):
    """
    获取问题的详细描述
    @param url:
    @return:
    """
    response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)

    """获取数据并解析"""
    soup = BeautifulSoup(response.text, 'lxml')
    # 问题标题
    title = soup.find("h1", {"class": "QuestionHeader-title"}).text
    # 具体问题
    question = ''
    try:
        question = soup.find("div", {"class": "QuestionRichText--collapsed"}).text.replace('\\u200b', '')
    except Exception as e:
        print(e)
    # 关注者
    follower = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[0].text.strip().replace(",", ""))
    # 被浏览
    watched = int(soup.find_all("strong", {"class": "NumberBoard-itemValue"})[1].text.strip().replace(",", ""))
    # 问题回答次数
    answer_str = soup.find_all("h4", {"class": "List-headerText"})[0].span.text.strip()
    # 抽取xxx 个回答中的数字:【正则】数字出现次数>=0
    answer_count = int(re.findall('\\d*', answer_str)[0])

    # 问题标签
    tag_list = []
    tags = soup.find_all("div", {"class": "QuestionTopic"})
    for tag in tags:
        tag_list.append(tag.text)

    return title, question, follower, watched, answer_count, tag_list


def init_url(question_id, limit, offset):
    """
    构造每一页访问的url
    @param question_id:
    @param limit:
    @param offset:
    @return:
    """
    base_url_start = "https://www.zhihu.com/api/v4/questions/"
    base_url_end = "/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed" \\
                   "%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by" \\
                   "%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count" \\
                   "%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info" \\
                   "%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting" \\
                   "%2Cis_thanked%2Cis_nothelp%2Cis_labeled%2Cis_recognized%2Cpaid_info%2Cpaid_info_content%3Bdata%5B" \\
                   "%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics" \\
                   "&limit={0}&offset={1}".format(limit, offset)

    return base_url_start + question_id + base_url_end


def get_time_str(timestamp):
    """
    将时间戳转换为标准日期字符
    @param timestamp:
    @return:
    """
    datetime_str = ''
    try:
        # 时间戳timestamp 转datetime时间格式
        datetime_time = datetime.fromtimestamp(timestamp)
        # datetime时间格式转为日期字符串
        datetime_str = datetime_time.strftime("%Y-%m-%d %H:%M:%S")
    except Exception as e:
        print(e)
        print("日期转换错误")

    return datetime_str


def get_answer_info(url, index):
    """
    解析问题回答
    @param url:
    @param index:
    @return:
    """
    response = requests.get(url=url, headers={'User-Agent': get_ua()}, timeout=10)
    text = response.text.replace('\\u200b', '')

    per_answer_list = []
    try:
        question_json = json.loads(text)
        """获取当前页的回答数据"""
        print("爬取第{0}页回答列表,当前页获取到{1}个回答".format(index + 1, len(question_json["data"])))
        for data in question_json["data"]:
            """问题的相关信息"""
            # 问题的问题类型、id、提问类型、创建时间、修改时间
            question_type = data["question"]['type']
            question_id = data["question"]['id']
            question_question_type = data["question"]['question_type']
            question_created = get_time_str(data["question"]['created'])
            question_updated_time = get_time_str(data["question"]['updated_time'])

            """答主的相关信息"""
            # 答主的用户名、签名、性别、粉丝数
            author_name = data["author"]['name']
            author_headline = data["author"]['headline']
            author_gender = data["author"]['gender']
            author_follower_count = data["author"]['follower_count']

            """回答的相关信息"""
            # 问题回答id、创建时间、更新时间、赞同数、评论数、具体内容
            id = data['id']
            created_time = get_time_str(data["created_time"])
            updated_time = get_time_str(data["updated_time"])
            voteup_count = data["voteup_count"]
            comment_count = data["comment_count"]
            content = data["content"]

            per_answer_list.append([question_type, question_id, question_question_type, question_created,
                                    question_updated_time, author_name, author_headline, author_gender,
                                    author_follower_count, id, created_time, updated_time, voteup_count, comment_count,
                                    content
                                    ])

    except:
        print("Json格式校验错误")
    finally:
        answer_column = ['问题类型', '问题id', '问题提问类型', '问题创建时间', '问题更新时间',
                         '答主用户名', '答主签名', '答主性别', '答主粉丝数',
                         '答案id', '答案创建时间', '答案更新时间', '答案赞同数', '答案评论数', '答案具体内容']
        per_answer_data = pd.DataFrame(per_answer_list, columns=answer_column)

    return per_answer_data


if __name__ == '__main__':
    # question_id = '424516487'
    question_id = '429548386'
    url = "https://www.zhihu.com/question/" + question_id
    """获取问题的详细描述"""
    title, question, follower, watched, answer_count, tag_list = get_question_base_info(url)
    print("问题url:"+ url)
    print("问题标题:" + title)
    print("问题描述:" + question)
    print("该问题被定义的标签为:" + '、'.join(tag_list))
    print("该问题关注人数:{0},已经被 {1} 人浏览过".format(follower, watched))
    print("截止 {},该问题有 {} 个回答".format(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()), answer_count))

    """获取问题的回答数据"""
    # 构造url
    limit, offset = 20, 0
    page_cnt = int(answer_count/limit) + 1
    answer_data = pd.DataFrame()
    for page_index in range(page_cnt):
        answer_url = init_url(question_id, limit, offset+page_index*limit)
        # 获取数据
        data_per_page = get_answer_info(answer_url, page_index)
        answer_data = answer_data.append(data_per_page)
        sleep(3)
    
    print("\\n爬取完成,数据已保存!!")

    answer_data.to_csv('凡尔赛沙雕语录_{0}.csv'.format(question_id), encoding='utf-8', index=False)

4.结果

一共爬取到 393 个答案,需要注意一下,最后保存的文件格式为 UTF-8,读取乱码的同学请先检查格式是否一致。

爬取的结果部分截图如下:


感谢看到这里,更多Python精彩内容可以关注我看我主页,你们的三连(点赞,收藏,评论)是我持续更新下去的动力,感谢。

点击领取🎁 Q群号: 675240729(纯技术交流和资源共享)以自助拿走。

①行业咨询、专业解答
②Python开发环境安装教程
③400集自学视频
④软件开发常用词汇
⑤最新学习路线图
⑥3000多本Python电子书

以上是关于爬取知乎“凡尔赛语录”话题下的所有回答,我知道点开看你的很帅气,但还是没我帅的主要内容,如果未能解决你的问题,请参考以下文章

爬取知乎某个问题下所有的图片

爬取知乎话题async使用协程

常用数据存储的介绍和使用

Python爬虫实战,Scrapy实战,爬取知乎表情包

爬取知乎百万用户信息之总结篇

Python爬取知乎与我所理解的爬虫与反爬虫