煎蛋网爬虫之JS逆向解析img路径
Posted Frank_Shen
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了煎蛋网爬虫之JS逆向解析img路径相关的知识,希望对你有一定的参考价值。
图片使用js onload事件加载
<p><img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc=</span></p>
找到soureces 文件中对应的js 方法jandan_load_img
通过debugger js 将Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc= 传入函数jdugRtgCtw78dflFjGXBvN6TBHAoKvZ7xu base64_decode得到img路经
再通过正则表达式将img路径中的(/W+)替换为large
爬取代码如下:
import base64 import re import requests from concurrent.futures import ThreadPoolExecutor from random import choice from lxml import etree from user_agent_list import USER_AGENTS headers = {\'user-agent\': choice(USER_AGENTS)} def fetch_url(url): \'\'\' :param url: 路径 :return: html \'\'\' try: r = requests.get(url, headers=headers) r.raise_for_status() r.encoding = r.apparent_encoding if r.status_code in [200, 201]: return r.text except Exception as e: print(e) def downloadone(url): html = fetch_url(url) data = etree.HTML(html) img_hash_list = data.xpath(\'//*[@class="img-hash"]/text()\') for img_hash in img_hash_list: img_path = \'http:\' + bytes.decode(base64.b64decode(img_hash)) img_path = re.sub(r\'mw\\d+\', \'large\', img_path) img_name = img_path.rsplit(\'/\', 1)[1] with open(\'jiandan/\'+img_name, \'wb\') as f: r = requests.get(img_path) f.write(r.content) def main(): url_list = [] for _ in range(1, 44): url = \'http://jandan.net/ooxx/page-{}\'.format(_) url_list.append(url) with ThreadPoolExecutor(4) as executor: executor.map(downloadone, url_list) if __name__ == \'__main__\': main()
以上是关于煎蛋网爬虫之JS逆向解析img路径的主要内容,如果未能解决你的问题,请参考以下文章
爬虫实例——爬取煎蛋网OOXX频道(反反爬虫——伪装成浏览器)