python爬虫获取下一页

Posted brady-wang

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python爬虫获取下一页相关的知识,希望对你有一定的参考价值。

from time import sleep

import faker
import requests
from lxml import etree

fake = faker.Faker()

base_url = "http://angelimg.spbeen.com"

def get_next_link(url):
    content = downloadHtml(url)
    html = etree.HTML(content)
    next_url = html.xpath("//a[@class=‘ch next‘]/@href")
    if next_url:
        return base_url + next_url[0]
    else:
        return False

def downloadHtml(ur):
    user_agent = fake.user_agent()
    headers = {User-Agent: user_agent,"Referer":"http://angelimg.spbeen.com/"}
    response = requests.get(url, headers=headers)
    return response.text

def getImgUrl(content):
    html  = etree.HTML(content)
    img_url = html.xpath(//*[@id="content"]/a/img/@src)
    title = html.xpath(".//div[‘@class=article‘]/h2/text()")

    return img_url[0],title[0]

def saveImg(title,img_url):
    if img_url is not None and title is not None:
        with open("txt/"+str(title)+".jpg",wb) as f:
            user_agent = fake.user_agent()
            headers = {User-Agent: user_agent,"Referer":"http://angelimg.spbeen.com/"}
            content = requests.get(img_url, headers=headers)
            #request_view(content)
            f.write(content.content)
            f.close()

def request_view(response):
    import webbrowser
    request_url = response.url
    base_url = <head><base href="%s"> %(request_url)
    base_url = base_url.encode()
    content = response.content.replace(b"<head>",base_url)
    tem_html = open(tmp.html,wb)
    tem_html.write(content)
    tem_html.close()
    webbrowser.open_new_tab(tmp.html)

def crawl_img(url):
    content = downloadHtml(url)
    res = getImgUrl(content)
    title = res[1]
    img_url = res[0]
    saveImg(title,img_url)

if __name__ == "__main__":
    url = "http://angelimg.spbeen.com/ang/4968/1"

    while url:
        print(url)
        crawl_img(url)
        url = get_next_link(url)

还有种方式,获取到总页数,再循环 

以上是关于python爬虫获取下一页的主要内容,如果未能解决你的问题,请参考以下文章

nodejs爬虫笔记---利用nightmare模拟点击下一页

使用python爬虫时,遇到多页,需要翻页,下一页时怎么处理

Python爬虫实例使用selenium抓取斗鱼直播平台数据

Python 爬虫方法总结

python网络爬虫之使用scrapy自动爬取多个网页

python下用selenium的webdriver包如何在执行完点击下一页后获得下一页新打开页面的html源代码呢?