使用Python批量下载网站图片

Posted 2020-08-15

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了使用Python批量下载网站图片相关的知识，希望对你有一定的参考价值。

　在网上冲浪的时候，总有些“小浪花”令人喜悦。没错，小浪花就是美图啦。边浏览边下载，自然是不错的；不过，好花不常开，好景不常在，想要便捷地保存下来，一个个地另存为还是很麻烦的。能不能批量下载呢？只要获得图片地址，还是不难的。

目标

太平洋摄影网，一个不错的摄影网站。如果你喜欢自然风光的话，不妨在上面好好饱览一顿吧。饱览一顿，或许你还想打包带走呢。这并不是难事，让我们顺藤摸瓜地来尝试一番吧（懒得截图，自己打开网站观赏吧）。

首先，我们打开网址 http://dp.pconline.com.cn/list/all_t145.html ；那么，马上有N多美妙的缩略图呈现在你面前。任意点击其中一个链接，就到了一个系列的第一张图片的页面： http://dp.pconline.com.cn/photo/3687487.html，再点击下可以到第二张图片的页面： http://dp.pconline.com.cn/photo/3687487_2.html，图片下方点击“查看原图”，会跳转到 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 这个页面，呈现出一张美美的高清图。右键另存为，就可以保存到本地。也许你的心已经开始痒痒啦。

该如何下手呢？只要你做过 web 开发，一定知道，在浏览器的控制台，会有页面的 html ，而 html 里会包含图片，或者是包含图片的另一个 HTML。对于上面的情况而言， http://dp.pconline.com.cn/list/all_t145.html 是一个大主题系列的入口页面，比如自然是 t145，建筑是 t292，记作 EntryHtml ；这个入口页面包含很多链接指向子的HTML，这些子 HTML 是这个大主题下的不同个性风格的摄影师拍摄的不同系列的美图，记作 SerialHtml ; 而这些 SerialHtml 又会包含一个子系列每一张图片的首 HTML，记作 picHtml ，这个 picHtml 包含一个“查看原图”链接，指向图片高清地址的链接 http://dp.pconline.com.cn/public/photo/source_photo.jsp?id=19706865&photoId=3687487 ，记作 picOriginLink ；最后，在 picOriginLink 里找到 img 元素，即高清图片的真真实地址 picOrigin。 (⊙v⊙)嗯，貌似有点绕晕了，我们来总结一下：

EntryHtml （主题入口页面） -> SerialHtml （子系列入口页面） -> picHtml （子系列图片浏览页面） -> picOriginLink （高清图片页面） -> picOrigin （高清图片的真实地址）

现在，我们要弄清楚这五级是怎么关联的。

经过查看 HTML 元素，可知：

(1) SerialHtml 元素是 EntryHtml 页面里的 class="picLink" 的 a 元素；

(2) picHtml 元素是 SerialHtml 的加序号的结果，比如 SerialHtml 是 http://dp.pconline.com.cn/photo/3687487.html，总共有 8 张，那么 picHtml = http://dp.pconline.com.cn/photo/3687487_[1-8].html ，注意到 http://dp.pconline.com.cn/photo/3687487.html 与 http://dp.pconline.com.cn/photo/3687487_1.html 是等效的，这会给编程带来方便。

(3) “查看原图” 是指向高清图片地址的页面 xxx.jsp 的链接：它是 picHtml 页面里的 class="aView aViewHD" 的 a 元素；

(4) 最后，从 xxx.jsp 元素中找出 src 为图片后缀的 img 元素即可。

那么，我们的总体思路就是：

STEP1：抓取 EntryHtml 的网页内容 entryContent ;

STEP2：解析 entryContent ，找到class="picLink" 的 a 元素列表 SerialHtmlList ；

STEP3：对于SerialHtmlList 的每一个网页 SerialHtml_i：

(1) 抓取其第一张图片的网页内容，解析出其图片总数 total ；

(2) 根据图片总数 total 并生成 total 个图片链接 picHtmlList ；

a. 对于 picHtmlList 的每一个网页，找到 class="aView aViewHD" 的 a 元素 hdLink ；

b. 抓取 hdLink 对应的网页内容，找到img元素获得最终的图片真实地址 picOrigin ；

c. 下载 picOrigin 。

注意到，一个主题系列有多页，比如首页是 EntryHtml ：http://dp.pconline.com.cn/list/all_t145.html ，第二页是 http://dp.pconline.com.cn/list/all_t145_p2.html ；首页等效于 http://dp.pconline.com.cn/list/all_t145_p1.html 这会给编程带来方便。要下载一个主题下多页的系列图片，只要在最外层再加一层循环。这就是串行版本的实现流程。

　　串行版本实现：　　

#!/usr/bin/python
#_*_encoding:utf-8_*_

import os
import re
import sys
import requestsfrom bs4 import BeautifulSoup

saveDir = os.environ[‘HOME‘] + ‘/joy/pic/pconline/nature‘

def catchExc(func):
    def _deco(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print "error catch exception for %s (%s, %s)." % (func.__name__, str(*args), str(**kwargs))
            print e
            return None
    return _deco


@catchExc
def getSoup(url):
    ‘‘‘
       get the html content of url and transform into soup object 
           in order to parse what i want later
    ‘‘‘
    result = requests.get(url)
    status = result.status_code
    if status != 200:
        return None
    resp = result.text
    soup = BeautifulSoup(resp, "lxml")
    return soup

@catchExc 
def parseTotal(soup):
    ‘‘‘
      parse total number of pics in html tag <span class="totPic"> (1/total)</span>
    ‘‘‘
    totalNode = soup.find(‘span‘, class_=‘totPics‘)
    total = int(totalNode.text.split(‘/‘)[1].replace(‘)‘,‘‘))
    return total

@catchExc 
def buildSubUrl(href, ind):
    ‘‘‘
    if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10
    then suburl is
        http://dp.pconline.com.cn/photo/3687736_[1-10].html
    which contain the origin href of picture
    ‘‘‘
    return href.rsplit(‘.‘, 1)[0] + "_" + str(ind) + ‘.html‘ 

@catchExc 
def download(piclink):
    ‘‘‘
       download pic from pic href such as 
            http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg
    ‘‘‘

    picsrc = piclink.attrs[‘src‘]
    picname = picsrc.rsplit(‘/‘,1)[1]
    saveFile = saveDir + ‘/‘ + picname

    picr = requests.get(piclink.attrs[‘src‘], stream=True)
    with open(saveFile, ‘wb‘) as f:
        for chunk in picr.iter_content(chunk_size=1024):  
            if chunk:
                f.write(chunk)
                f.flush() 
    f.close()

@catchExc 
def downloadForASerial(serialHref):
    ‘‘‘
       download a serial of pics  
    ‘‘‘

    href = serialHref
    subsoup = getSoup(href)
    total = parseTotal(subsoup)
    print ‘href: %s *** total: %s‘ % (href, total)
    
    for ind in range(1, total+1):
        suburl = buildSubUrl(href, ind)
        print "suburl: ", suburl
        subsoup = getSoup(suburl)

        hdlink = subsoup.find(‘a‘, class_=‘aView aViewHD‘)
        picurl = hdlink.attrs[‘href‘]

        picsoup = getSoup(picurl)
        piclink = picsoup.find(‘img‘, src=re.compile(".jpg"))
        download(piclink)
      

@catchExc 
def downloadAllForAPage(entryurl):
    ‘‘‘
       download serial pics in a page
    ‘‘‘

    soup = getSoup(entryurl)
    if soup is None:
        return
    #print soup.prettify()
    picLinks = soup.find_all(‘a‘, class_=‘picLink‘)
    if len(picLinks) == 0:
        return
    hrefs = map(lambda link: link.attrs[‘href‘], picLinks)
    print ‘serials in a page: ‘, len(hrefs)

    for serialHref in hrefs: 
        downloadForASerial(serialHref)

def downloadEntryUrl(serial_num, index):
    entryUrl = ‘http://dp.pconline.com.cn/list/all_t%d_p%d.html‘ % (serial_num, index)
    print "entryUrl: ", entryUrl
    downloadAllForAPage(entryUrl)
    return 0

def downloadAll(serial_num):
    start = 1     
    end = 2
    return [downloadEntryUrl(serial_num, index) for index in range(start, end+1)] 

serial_num = 145

if __name__ == ‘__main__‘:
    downloadAll(serial_num)

很显然，串行版本会比较慢，CPU 长时间等待网络连接和操作。要提高性能，通常是采用如下措施：

(1) 使用多线程将 io 密集型操作隔离开，避免CPU等待；

(2) 单个循环操作改为批量操作，更好地利用并发；

(3) 使用多进程进行 CPU 密集型操作，更充分利用多核的力量。

批量并发版本（线程池貌似有点多，有点不稳定，后优化）：

#!/usr/bin/python
#_*_encoding:utf-8_*_

import os
import re
import sys
from multiprocessing.dummy import Pool as ThreadPool

import requests
from bs4 import BeautifulSoup

saveDir = os.environ[‘HOME‘] + ‘/joy/pic/pconline‘
dwpicPool = ThreadPool(20)

def catchExc(func):
    def _deco(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print "error catch exception for %s (%s, %s): %s" % (func.__name__, str(*args), str(**kwargs), e)
            return None
    return _deco


@catchExc
def batchGetSoups(urls):
    ‘‘‘
       get the html content of url and transform into soup object 
           in order to parse what i want later
    ‘‘‘

    urlnum = len(urls)
    if urlnum == 0:
        return []

    getUrlPool = ThreadPool(urlnum)
    results = []
    for i in range(urlnum):
        results.append(getUrlPool.apply_async(requests.get, (urls[i], )))
    getUrlPool.close()
    getUrlPool.join()

    soups = []
    for res in results:
        r = res.get(timeout=1) 
        status = r.status_code

        if status != 200:
            continue
        resp = r.text
        soup = BeautifulSoup(resp, "lxml")
        soups.append(soup)
    return soups

@catchExc 
def parseTotal(soup):
    ‘‘‘
      parse total number of pics in html tag <span class="totPic"> (1/total)</span>
    ‘‘‘
    totalNode = soup.find(‘span‘, class_=‘totPics‘)
    total = int(totalNode.text.split(‘/‘)[1].replace(‘)‘,‘‘))
    return total

@catchExc 
def buildSubUrl(href, ind):
    ‘‘‘
    if href is http://dp.pconline.com.cn/photo/3687736.html, total is 10
    then suburl is
        http://dp.pconline.com.cn/photo/3687736_[1-10].html
    which contain the origin href of picture
    ‘‘‘
    return href.rsplit(‘.‘, 1)[0] + "_" + str(ind) + ‘.html‘ 

@catchExc 
def downloadPic(piclink):
    ‘‘‘
       download pic from pic href such as 
            http://img.pconline.com.cn/images/upload/upc/tx/photoblog/1610/21/c9/28691979_1477032141707.jpg
    ‘‘‘

    picsrc = piclink.attrs[‘src‘]
    picname = picsrc.rsplit(‘/‘,1)[1]
    saveFile = saveDir + ‘/‘ + picname

    picr = requests.get(piclink.attrs[‘src‘], stream=True)
    with open(saveFile, ‘wb‘) as f:
        for chunk in picr.iter_content(chunk_size=1024):  
            if chunk:
                f.write(chunk)
                f.flush() 
    f.close()

@catchExc
def getOriginPicLink(subsoup):
    hdlink = subsoup.find(‘a‘, class_=‘aView aViewHD‘)
    return hdlink.attrs[‘href‘]

@catchExc 
def downloadForASerial(serialHref):
    ‘‘‘
       download a serial of pics  
    ‘‘‘

    href = serialHref
    subsoups = batchGetSoups([href])
    total = parseTotal(subsoups[0])
    print ‘href: %s *** total: %s‘ % (href, total)
   
    suburls = [buildSubUrl(href, ind) for ind in range(1, total+1)]
    subsoups = batchGetSoups(suburls)
    picUrls = map(getOriginPicLink, subsoups)
    picSoups = batchGetSoups(picUrls)
    piclinks = map(lambda picsoup: picsoup.find(‘img‘, src=re.compile(".jpg")), picSoups)
    dwpicPool.map_async(downloadPic, piclinks) 

def downloadAllForAPage(entryurl):
    ‘‘‘
       download serial pics in a page
    ‘‘‘

    soups = batchGetSoups([entryurl])
    if len(soups) == 0:
        return

    soup = soups[0] 
    #print soup.prettify()
    picLinks = soup.find_all(‘a‘, class_=‘picLink‘)
    if len(picLinks) == 0:
        return
    hrefs = map(lambda link: link.attrs[‘href‘], picLinks)

    for serialHref in hrefs: 
        downloadForASerial(serialHref)

def downloadAll(serial_num, start, end):
    entryUrl = ‘http://dp.pconline.com.cn/list/all_t%d_p%d.html‘
    entryUrls = [ (entryUrl % (serial_num, ind)) for ind in range(start, end+1)]
    taskpool = ThreadPool(20)
    taskpool.map_async(downloadAllForAPage, entryUrls)
    taskpool.close()
    taskpool.join()

if __name__ == ‘__main__‘:
    serial_num = 145
    downloadAll(serial_num, 1, 2)

以上是关于使用Python批量下载网站图片的主要内容，如果未能解决你的问题，请参考以下文章