如何使用多处理来加速 bs4 抓取和图像下载

Posted

技术标签:

【中文标题】如何使用多处理来加速 bs4 抓取和图像下载【英文标题】:How can I use multiprocessing to speed up bs4 scraping and image downloading 【发布时间】:2022-01-04 23:27:42 【问题描述】:

所以我有这段代码:

from bs4 import *
import requests
import os
import pandas
df = pandas.read_csv(r'C:\Users\fani\Desktop\History.csv')

folder_name = "downloadedpics"
os.mkdir(folder_name)

z=1

for j in df['url']:

    # DOWNLOAD ALL IMAGES FROM THAT URL
    def download_images(images, folder_name):
        # initial count is zero
        count = 0

        # print total images found in URL
        print(f"Total len(images) Image Found!")

        # checking if images is not zero
        if len(images) != 0:
            for i, image in enumerate(images):
                # From image tag ,Fetch image Source URL

                # 1.data-srcset
                # 2.data-src
                # 3.data-fallback-src
                # 4.src

                # Here we will use exception handling

                # first we will search for "data-srcset" in img tag
                try:
                    # In image tag ,searching for "data-srcset"
                    image_link = image["data-srcset"]

                # then we will search for "data-src" in img
                # tag and so on..
                except:
                    try:
                        # In image tag ,searching for "data-src"
                        image_link = image["data-src"]
                    except:
                        try:
                            # In image tag ,searching for "data-fallback-src"
                            image_link = image["data-fallback-src"]
                        except:
                            try:
                                # In image tag ,searching for "src"
                                image_link = image["src"]

                            # if no Source URL found
                            except:
                                pass

                # After getting Image Source URL
                # We will try to get the content of image
                try:
                    r = requests.get(image_link).content
                    with open(f"folder_name/zimagesi + 1.jpg", "wb+") as f:
                        f.write(r)

                    # counting number of image downloaded
                    count += 1
                except:
                    pass

            # There might be possible, that all
            # images not download
            # if all images download
            if count == len(images):
                print("All Images Downloaded!")

            # if all images not download
            else:
                print(f"Total count Images Downloaded Out of len(images)")


    # MAIN FUNCTION START
    def main(url):
        # content of URL
        r = requests.get(url)

        # Parse html Code
        soup = BeautifulSoup(r.text, 'html.parser')

        # find all images in URL
        images = soup.findAll('img', class_='pannable-image')

        # Call folder create function
        download_images(images, folder_name)


    # take url
    url = j

    # CALL MAIN FUNCTION
    main(url)
    print(z)
    z = z + 1

它会抓取一堆 url(在 history.csv 中列出)并从中下载一些图像。 唯一的问题是这样一个简单的任务真的很慢。 实现多处理以加快速度的正确方法是什么? 我是新手,我不知道多处理是如何工作的

编辑: 这是csv文件: mega link

该代码应该从 1648 个网页(该电子商务网站上的网页图库部分)下载大约 12000 张图像,相当于大约 1GB 的数据

【问题讨论】:

请求优化的一点是使用异步作为 I/O 绑定而不是多处理。 你能告诉我这是怎么做的吗?使用这种方法可以节省多少时间? 【参考方案1】:

既然您已经在使用requests 包,那么显而易见的方法是使用multithreading 而不是asyncio,这需要您放弃requests 并学习aiohttp

我已经对代码进行了相当多的重组,并且由于无法访问您的 CSV 文件而无法对其进行测试,因此我强烈建议您查看我所做的并尝试通过阅读以尽可能地理解它对您来说是新的各种类和方法的 Python 文档。我不明白为什么当你检索一个图像文件时你试图解码它。我想您希望这会产生错误,但这似乎是在浪费时间。

我已将多线程池大小任意设置为 100(多线程可以轻松处理数倍大的池大小,尽管 asyncio 可以处理数千个并发任务)。将N_THREADS 设置为 URL 数乘以每个 URL 需要下载的平均图像数,但不超过 500。

from bs4 import *
import requests
import os
import pandas
from multiprocessing.pool import ThreadPool
from functools import partial
from threading import Lock

    
class FileIndex:
    """
    Increment and return the next index to use for creating a file
    that is threadsafe.
    """
    
    def __init__(self):
        self._lock = Lock()
        self._file_index = 0

    @property
    def next_file_index(self):
        with self._lock:
            self._file_index += 1
            return self._file_index


# DOWNLOAD AN IMAGE FROM THAT URL
def download_image(image, session, file_index, folder_number, folder_name):
    # From image tag ,Fetch image Source URL

    # 1.data-srcset
    # 2.data-src
    # 3.data-fallback-src
    # 4.src

    # Here we will use exception handling

    # first we will search for "data-srcset" in img tag
    try:
        # In image tag ,searching for "data-srcset"
        image_link = image["data-srcset"]

    # then we will search for "data-src" in img
    # tag and so on..
    except:
        try:
            # In image tag ,searching for "data-src"
            image_link = image["data-src"]
        except:
            try:
                # In image tag ,searching for "data-fallback-src"
                image_link = image["data-fallback-src"]
            except:
                try:
                    # In image tag ,searching for "src"
                    image_link = image["src"]

                # if no Source URL found
                except:
                    return 0 # no image loaded

    # After getting Image Source URL
    # We will try to get the content of image
    try:
        r = session.get(image_link).content
        # Why are you trying to decode an image?
        try:
            # possibility of decode
            r = str(r, 'utf-8')
            return 0 # no error return 0 ?????

        except UnicodeDecodeError:

            # After checking above condition, Image Download start
            with open(f"folder_name/folder_numberimagesfile_index.next_file_index.jpg", "wb+") as f:
                f.write(r)

            # counting number of image downloaded
            return 1 # 1 downloaded
    except:
        return 0 # 0 downloaded

# download_url FUNCTION START
def download_url(folder_number, url, session, folder_name, thread_pool):
    # content of URL
    r = session.get(url)

    # Parse HTML Code
    soup = BeautifulSoup(r.text, 'html.parser')

    # find all images in URL
    images = soup.findAll('img', class_='pannable-image')

    # Call folder create function
    worker = partial(download_image,
                     session=session,
                     file_index=FileIndex(),
                     folder_number=folder_number,
                     folder_name=folder_name)
    counts = thread_pool.map(worker, images)
    total_counts = sum(counts)
    if total_counts == len(images):
        print(f"All Images Downloaded for URL url!")
    else:
        print(f"Total total_counts Images Downloaded Out of len(images) for URL url")

# The real main function:
def main():
    df = pandas.read_csv(r'C:\Users\fani\Desktop\History.csv')
    folder_name = "downloadedpics"
    os.mkdir(folder_name)
    
    N_THREADS_URLS = 50 # or some suitable size for retrieving URLS
    N_THREADS_IMAGES = 500 # or some suitable size for retrieving images

    # use a session for efficiency:
    with requests.Session() as session, \
    ThreadPool(N_THREADS_URLS) as thread_pool_urls, \
    ThreadPool(N_THREADS_IMAGES) as thread_pool_images:
        worker = partial(download_url,
                         session=session,
                         folder_name=folder_name,
                         thread_pool=thread_pool_images)
        results = thread_pool_urls.starmap(worker, enumerate(df))


if __name__ == '__main__':
    main()

【讨论】:

谢谢,它可以工作而且速度非常快。但由于某种原因,它不会保存所有图片,并且保存的图片数量因运行相同的数据和页面而异。如果 n_threads 小于 url 的数量(我认为),它显然也会在一定数量的 url 之后停止处理。我还在 OP 中添加了指向我的 csv 的链接。我想这不适合我,我需要更多地研究多线程和/或 aiohttp,因为我不完全理解它们。再次感谢您的帮助。 我点击了链接,它显示“您请求的文件已被删除”。 是的,N_THREADS 必须大于您拥有的 URL 数量。我已经更新了答案,而不是两个使用 two 单独的线程池,一个用于检索 URL,另一个用于检索图像。每个理论上可以小到 1。但是将它们设置为与您分别拥有的 URL 和图像数量近似的大小,但仍然保持大约 500 个的上限。 你试过更新的代码了吗?那么丢失的 CSV 文件呢? 它确实有效,但就像我说的那样,我有大约 1700 个网址和大约 12000 张图片,所以 500 不会为我做。当我设置 N_THREADS_URLS = 1700 和 N_THREADS_IMAGES = 12000 时,它会扼杀我主要在 RAM 上工作的弱系统(但它似乎正在工作,我需要进行更多测试才能确定)。我也更新了 csv 文件链接

以上是关于如何使用多处理来加速 bs4 抓取和图像下载的主要内容,如果未能解决你的问题,请参考以下文章

网络抓取多处理不起作用

Python如何图像识别?

ImportError:Mac 上没有名为 bs4 的模块

如何使用带有 YouTube API 的 Python 多处理进行抓取 [关闭]

使用 BS4 或 Selenium 从 finishline.com 抓取网页

碎片时间 ▏Caffe 深度学习框架