并行获取文件

Posted 2023-04-14

技术标签:

【中文标题】并行获取文件【英文标题】：Parallel fetching of files 【发布时间】：2012-02-18 22:00:33 【问题描述】：

为了下载文件，我正在创建一个 urlopen 对象（urllib2 类）并分块读取它。

我想多次连接到服务器并在六个不同的会话中下载文件。这样做，下载速度应该会更快。许多下载管理器都有此功能。

我考虑在每个会话中指定我想下载的文件部分，并以某种方式同时处理所有会话。我不确定如何实现这一目标。

【问题讨论】：

【参考方案1】：

关于运行并行请求，您可能希望使用urllib3 或requests。

我花了一些时间列出了类似的问题：

寻找[python] +download +concurrent 给出了这些有趣的：

Concurrent downloads - Python What is the fastest way to send 100,000 HTTP requests in Python? Library or tool to download multiple files in parallell Download multiple pages concurrently? Python: simple async download of url content? Python, gevent, urllib2.urlopen.read(), download accelerator Python/Urllib2/Threading: Single download thread faster than multiple download threads. Why? Scraping landing pages of a list of domains A clean, lightweight alternative to Python's twisted?

寻找[python] +http +concurrent会得到这些：

Python: How to make multiple HTTP POST queries in one moment? Multi threaded web scraper using urlretrieve on a cookie-enabled site

寻找[python] +urllib2 +slow：

Python urllib2.open is slow, need a better way to read several urls Python 2.6: parallel parsing with urllib2 How can I speed up fetching pages with urllib2 in python? Threading HTTP requests (with proxies)

寻找[python] +download +many：

Python,multi-threads,fetch webpages,download webpages Downloading files in twisted using queue Python: Something like map that works on threads Rotating Proxies for web scraping Anyone know of a good Python based web crawler that I could use?

【讨论】：

【参考方案2】：

听起来您想使用一种可用的HTTP Range。

编辑更新了指向 w3.org 存储的 RFC 的链接

【讨论】：

感谢您提及这一点 - 更新了指向 w3.org RFC 的链接，该链接应该不那么短暂。【参考方案3】：

正如我们已经讨论过的那样，我使用 PycURL 制作了这样一个。

我必须做的一件事，也是唯一一件事是pycurl_instance.setopt(pycurl_instance.NOSIGNAL, 1) 以防止崩溃。

我确实使用 APScheduler 在单独的线程中触发请求。感谢您在主线程中将忙等待 while True: pass 更改为 while True: time.sleep(3) 的建议，代码表现得非常好，并且使用 python-daemon 包应用程序中的 Runner 模块几乎可以用作典型的 UN*X 守护程序。

【讨论】：

以上是关于并行获取文件的主要内容，如果未能解决你的问题，请参考以下文章