Python爬虫——刚学会爬虫，第一次实践就爬取了《长津湖》影评数据

Posted 2022-11-11 训练营资料福利官

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python爬虫——刚学会爬虫，第一次实践就爬取了《长津湖》影评数据相关的知识，希望对你有一定的参考价值。

思路：

数据采集
清洗入库
分析处理

1. 数据采集

接口地址

https://m.maoyan.com/mmdb/comments/movie/257706.json?_v_=yes&offset=15&startTime=

解析地址：

257706 代表电影ID 长津湖

offset=15 代表：每次加载多少条数据15条

startTime：从什么时间段开始加载

API_URL = "https://m.maoyan.com/mmdb/comments/movie/movie_id.json?_v_=yes&offset=15&startTime=last_time"

# 获取长津湖 的最新的评论数据

url = API_URL.format(movie_id=257706, last_time="")
print(url)

# 获取较早期的  评论数据
url = API_URL.format(movie_id=257706, last_time="2021-10-05 13:01:10")
print(url)

触发反爬

抱歉，您的访问请求过于频繁

解决反爬

把自己伪装一个普通用户

修改请求，把Python爬虫，伪装成普通的浏览器用户

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: zh-CN,zh;q=0.9
Cache-Control: max-age=0
Connection: keep-alive
Cookie: Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1634128086; _lxsdk_cuid=17c799fc25bc8-0867635e9e187f-4343363-144000-17c799fc25bc8; uuid_n_v=v1; iuuid=2109DC402C2111EC94FB7932D5F4446B6CE405D5DDDA4A72A33C4117D64B4044; webp=true; ci=70%2C%E9%95%BF%E6%B2%99; ci=70%2C%E9%95%BF%E6%B2%99; ci=70%2C%E9%95%BF%E6%B2%99; featrues=[object Object]; _lxsdk=212B1FE02C2111EC93EAADFB465559EE1590794E3DEC46C1BCF9EC31597525FC; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1634129570; _lxsdk_s=17c799fc25b-94c-10f-437%7C%7C165
Host: m.maoyan.com
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36

自动加载之前数据

last_time = ""  # 为空时候表示获取最新的数据，有参数的时候，获取指定时间的数据

for _ in range(3):  # 暂时只获取3页数据  3* 15  =45 条
    url = API_URL.format(movie_id=257706, last_time=last_time)
    print(url)
    resp = requests.get(url, headers=new_headers)  # 爬虫数据的时候，使用请求头来伪装自己

    last_time = resp.json()["cmts"][-1]["startTime"]  # 上一次数据中的最早的评论时间
    print(resp.json())

写入到文件

json 数据写入到文件：Python字典 ==== json.dumps ====》字符串

2. 清洗入库

从原始文件中，加载数据
请求整理数据
写入csv文件
使用excel 打开

with open(f"movie_id.csv", "w", encoding="utf-8-sig", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["时间", "城市", "昵称", "性别", "打分", "认可数", "评论内容"])

        for comment in comments:
            print(comment)
            # break
            writer.writerow(
                [
                    comment["startTime"],
                    comment["cityName"],
                    comment["nick"],
                    comment.get("gender", ""),
                    comment["score"],
                    comment["approve"],
                    comment["content"],
                ]
            )