煎蛋网爬虫之JS逆向解析img路径

Posted Frank_Shen

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了煎蛋网爬虫之JS逆向解析img路径相关的知识,希望对你有一定的参考价值。

图片使用js onload事件加载

<p><img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc=</span></p>

找到soureces 文件中对应的js 方法jandan_load_img

通过debugger  js  将Ly93eDEuc2luYWltZy5jbi9tdzYwMC8wMDd1ejNLN2x5MWZ6NmVub3ExdHhqMzB1MDB1MGFkMC5qcGc= 传入函数jdugRtgCtw78dflFjGXBvN6TBHAoKvZ7xu base64_decode得到img路经

再通过正则表达式将img路径中的(/W+)替换为large

爬取代码如下:

import base64
import re
import requests
from concurrent.futures import ThreadPoolExecutor
from random import choice
from lxml import etree
from user_agent_list import USER_AGENTS
headers = {\'user-agent\': choice(USER_AGENTS)}


def fetch_url(url):
    \'\'\'
    :param url: 路径
    :return: html
    \'\'\'
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        if r.status_code in [200, 201]:
            return r.text
    except Exception as e:
        print(e)


def downloadone(url):
    html = fetch_url(url)
    data = etree.HTML(html)
    img_hash_list = data.xpath(\'//*[@class="img-hash"]/text()\')
    for img_hash in img_hash_list:
        img_path = \'http:\' + bytes.decode(base64.b64decode(img_hash))
        img_path = re.sub(r\'mw\\d+\', \'large\', img_path)
        img_name = img_path.rsplit(\'/\', 1)[1]
        with open(\'jiandan/\'+img_name, \'wb\') as f:
            r = requests.get(img_path)
            f.write(r.content)


def main():
    url_list = []
    for _ in range(1, 44):
        url = \'http://jandan.net/ooxx/page-{}\'.format(_)
        url_list.append(url)
    with ThreadPoolExecutor(4) as executor:
       executor.map(downloadone, url_list)


if __name__ == \'__main__\':
    main()

  

以上是关于煎蛋网爬虫之JS逆向解析img路径的主要内容,如果未能解决你的问题,请参考以下文章

爬虫实例——爬取煎蛋网OOXX频道(反反爬虫——伪装成浏览器)

python3爬虫爬取煎蛋网妹纸图片

python3煎蛋网的爬虫

解密阿里巴巴加密技术: 爬虫JS逆向实践-1688 JS混淆加密解析

解析煎蛋图片url的js加载

#yyds干货盘点# Python网络爬虫之js逆向之远程调用(rpc)免去抠代码补环境简介