虎牙直播数据采集，为数据分析做储备，Python爬虫120例之第24例

Posted 2021-09-14 梦想橡皮擦

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了虎牙直播数据采集，为数据分析做储备，Python爬虫120例之第24例相关的知识，希望对你有一定的参考价值。

今天要抓取的是虎牙频道的直播页，本篇博客的学习重点，依旧是多线程爬虫。

目标数据分析

本次要采集的数据列表呈现如下，其中数据在切换时，来自于服务器接口，故本案例为面向接口的多线程爬虫。

接口 API 如下所示：

https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=getLiveListJsonpCallback&page=2
https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=getLiveListJsonpCallback&page=3

接口请求方式为：GET
服务器数据返回格式为：JSON
其中参数说明如下：

m：猜测为频道的意思；
do：接口名称；
tagAll：标签名；
callback：回调函数；
page：页码。

测试接口，发现除 callback 参数可以不传递以外，其余参数必须传递。

当 page 超出页码之后，数据返回内容为：

{
  "status": 200,
  "message": "",
  "data": {
    "page": 230,
    "pageSize": 120,
    "totalPage": 228,
    "totalCount": 0,
    "datas": [],
    "time": 1630141551
  }
}

基于上述代码，可通过第一次访问接口，获取到 totalPage，然后在生成所有待抓取的链接。

编码时间

本案例代码编写难度中等，核心在服务器返回数据部分，由于数据是异步加载，故返回的数据为下图所示，当 callback 参数存在值时，返回的数据也被参数值 getLiveListJsonpCallback 包裹，去除该参数值即为图二所示。

图一

图二

如果继续携带 callback 参数，可使用如下代码对返回数据进行修改，即删除相应内容头部多余数据，并且删除最后一个括号数据。

res.encoding = 'utf-8'
text = res.text
text = text.replace('getLiveListJsonpCallback(', '')
text = text[:-1]

完整虎牙直播 JSON 数据爬取

本案例与上一案例实现逻辑基本一致，仅在数据请求与解析出体现出细微差别，大家在学习时，可以对比学习。

最后获取到的数据，直接存储为 JSON 格式数据，可自行更改其它格式。

import threading
import requests
import random

class Common:
    def __init__(self):
        pass

    def get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
            "其余内容"
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua,
            "referer": "https://www.baidu.com"
        }
        return headers


def run(index, url, semaphore, headers):
    semaphore.acquire()  # 加锁
    res = requests.get(url, headers=headers, timeout=5)
    res.encoding = 'utf-8'
    text = res.text
    text = text.replace('getLiveListJsonpCallback(', '')
    text = text[:-1]
    # print(text)
    # json_data = json.loads(text)
    # print(json_data)
    save(index,text)
    semaphore.release()  # 释放


def save(index, text):
    with open(f"./虎牙/{index}.json", "w", encoding="utf-8") as f:
        f.write(f"{text}")
    print("该URL地址数据写入完毕")


if __name__ == '__main__':
    # 获取总页码
    first_url = 'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=&page=1'
    c = Common()
    res = requests.get(url=first_url, headers=c.get_headers())
    data = res.json()
    if data['status'] == 200:
        total_page = data['data']['totalPage']

    url_format = 'https://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&tagAll=0&callback=&page={}'
    # 拼接URL，全局共享变量
    urls = [url_format.format(i) for i in range(1, total_page)]
    # 最多允许5个线程同时运行
    semaphore = threading.BoundedSemaphore(5)
    for i, url in enumerate(urls):
        t = threading.Thread(target=run, args=(i, url, semaphore, c.get_headers()))
        t.start()
    while threading.active_count() != 1:
        pass
    else:
        print('所有线程运行完毕')

收藏时间

代码仓库地址：https://codechina.csdn.net/hihell/python120，去给个关注或者 Star 吧。

数据直接购买渠道

来都来了，不发个评论，点个赞，收个藏吗？

今天是持续写作的第 205 / 365 天。
可以关注我，点赞我、评论我、收藏我啦。

更多精彩

Python 爬虫 100 例教程导航帖（已完结）

以上是关于虎牙直播数据采集，为数据分析做储备，Python爬虫120例之第24例的主要内容，如果未能解决你的问题，请参考以下文章

Python高级应用程序设计任务要求

Scrapy:虎牙爬取，图片存储与数据分析

5G低延时的误区和机会——从理论到工程落地的数据差异

聊聊数据仓库的建设

虎牙不想做一家游戏直播公司