如何使用 Urllib2 更有效地抓取?

Posted

技术标签:

【中文标题】如何使用 Urllib2 更有效地抓取?【英文标题】:How to scrape more efficiently with Urllib2? 【发布时间】:2013-07-15 03:58:43 【问题描述】:

这里是新手。我使用 urllib2 编写了一个简单的脚本来浏览 Billboard.com,并从 1958 年到 2013 年每周抓取***歌曲和艺术家。问题是它非常慢 - 需要几个小时才能完成。

我想知道瓶颈在哪里,是否有办法使用 Urllib2 更有效地抓取,或者我是否需要使用更复杂的工具?

import re
import urllib2
array = []
url = 'http://www.billboard.com/charts/1958-08-09/hot-100'
date = ""
while date != '2013-07-13':
    response = urllib2.urlopen(url)
    htmlText = response.read()
    date = re.findall('\d\d\d\d-\d\d-\d\d',url)[0]
    song = re.findall('<h1>.*</h1>', htmlText)[0]
    song = song[4:-5]
    artist = re.findall('/artist.*</a>', htmlText)[1]
    artist = re.findall('>.*<', artist)[0]
    artist = artist[1:-1]
    nextWeek = re.findall('href.*>Next', htmlText)[0]
    nextWeek = nextWeek[5:-5]
    array.append([date, song, artist])
    url = 'http://www.billboard.com' + nextWeek
print array

【问题讨论】:

Scrapy 会表现得更好,而且它肯定是完成这项工作的工具。让我知道你是否同意 - 我会为你写一个示例蜘蛛。 改进包括不使用 urllib2、不使用正则表达式解析 html 以及使用多个线程来执行 i/o。 我真诚地怀疑urllib2 是否与任何效率问题有关。它所做的只是发出请求并拉下响应; 99.99% 的时间是网络时间,这是其他方法无法改进的。问题是(a)您的解析代码可能非常慢,(b)您可能正在执行大量重复或其他不必要的下载,(c)您需要并行下载(您可以使用urllib2就像其他任何事情一样容易),(d)您需要更快的网络连接,或者(e)billboard.com正在限制您。 @alecxe 谢谢,非常感谢。 @Ben321,完成,请检查答案。在我的笔记本电脑上工作了 15 分钟。 【参考方案1】:

您的瓶颈几乎肯定是从网站获取数据。每个网络请求都有延迟,这会阻止其他任何事情在此期间发生。您应该考虑跨多个线程拆分请求,以便一次可以发送多个请求。基本上,您的性能受 I/O 限制,而不是 CPU 限制。

这是一个从头开始构建的简单解决方案,以便您了解爬虫的一般工作原理。从长远来看,使用像 Scrapy 这样的东西可能是最好的,但我发现从简单而明确的东西开始总是有帮助的。

import threading
import Queue
import time
import datetime
import urllib2
import re

class Crawler(threading.Thread):
    def __init__(self, thread_id, queue):
        threading.Thread.__init__(self)
        self.thread_id = thread_id
        self.queue = queue

        # let's use threading events to tell the thread when to exit
        self.stop_request = threading.Event()

    # this is the function which will run when the thread is started
    def run(self):
        print 'Hello from thread %d! Starting crawling...' % self.thread_id

        while not self.stop_request.isSet():
            # main crawl loop

            try:
                # attempt to get a url target from the queue
                url = self.queue.get_nowait()
            except Queue.Empty:
                # if there's nothing on the queue, sleep and continue
                time.sleep(0.01)
                continue

            # we got a url, so let's scrape it!
            response = urllib2.urlopen(url) # might want to consider adding a timeout here
            htmlText = response.read()

            # scraping with regex blows.
            # consider using xpath after parsing the html using lxml.html module
            song = re.findall('<h1>.*</h1>', htmlText)[0]
            song = song[4:-5]
            artist = re.findall('/artist.*</a>', htmlText)[1]
            artist = re.findall('>.*<', artist)[0]
            artist = artist[1:-1]

            print 'thread %d found artist:', (self.thread_id, artist)

    # we're overriding the default join function for the thread so
    # that we can make sure it stops
    def join(self, timeout=None):
        self.stop_request.set()
        super(Crawler, self).join(timeout)

if __name__ == '__main__':
    # how many threads do you want?  more is faster, but too many
    # might get your IP blocked or even bring down the site (DoS attack)
    n_threads = 10

    # use a standard queue object (thread-safe) for communication
    queue = Queue.Queue()

    # create our threads
    threads = []
    for i in range(n_threads):
        threads.append(Crawler(i, queue))

    # generate the urls and fill the queue
    url_template = 'http://www.billboard.com/charts/%s/hot-100'
    start_date = datetime.datetime(year=1958, month=8, day=9)
    end_date = datetime.datetime(year=1959, month=9, day=5)
    delta = datetime.timedelta(weeks=1)

    week = 0
    date = start_date + delta*week
    while date <= end_date:
        url = url_template % date.strftime('%Y-%m-%d')
        queue.put(url)
        week += 1
        date = start_date + delta*week

    # start crawling!
    for t in threads:
        t.start()

    # wait until the queue is empty
    while not queue.empty():
        time.sleep(0.01)

    # kill the threads
    for t in threads:
        t.join()

【讨论】:

也许最好也更彻底地解释一下使用并发来提高 CPU 性能(并行性)和使用并发来提高数据吞吐量或响应能力(你在这里所做的那种并发)之间的区别所以 OP 对为什么这样做有更深入的了解。 Brendan 非常有帮助和很好的解释,谢谢! 优秀的答案@BrendanWood。发现队列并发绝对是做到这一点的方法。使用 50 个并发线程(在我的家用机器/网络上测试是极限),这大约需要 10 分钟。太棒了!【参考方案2】:

这是使用Scrapy 的解决方案。看看overview,您就会明白它就是为此类任务而设计的工具:

速度很快(基于扭曲) 易于使用和理解 基于 xpaths 的内置提取机制(您也可以使用bslxml) 内置支持将提取的项目通过管道传输到数据库、xml、json 等 还有更多功能

这是一个工作蜘蛛,可以提取您所询问的所有内容(在我相当旧的笔记本电脑上工作了 15 分钟):

import datetime
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class BillBoardItem(Item):
    date = Field()
    song = Field()
    artist = Field()


BASE_URL = "http://www.billboard.com/charts/%s/hot-100"


class BillBoardSpider(BaseSpider):
    name = "billboard_spider"
    allowed_domains = ["billboard.com"]

    def __init__(self):
        date = datetime.date(year=1958, month=8, day=9)

        self.start_urls = []
        while True:
            if date.year >= 2013:
                break

            self.start_urls.append(BASE_URL % date.strftime('%Y-%m-%d'))
            date += datetime.timedelta(days=7)

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        date = hxs.select('//span[@class="chart_date"]/text()').extract()[0]

        songs = hxs.select('//div[@class="listing chart_listing"]/article')
        for song in songs:
            item = BillBoardItem()
            item['date'] = date
            try:
                item['song'] = song.select('.//header/h1/text()').extract()[0]
                item['artist'] = song.select('.//header/p[@class="chart_info"]/a/text()').extract()[0]
            except:
                continue

            yield item

将其保存到billboard.py 并通过scrapy runspider billboard.py -o output.json 运行。然后,在output.json 你会看到:

...
"date": "September 20, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"
"date": "September 20, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"
"date": "September 20, 1958", "artist": "The Elegants", "song": "Little Star"
"date": "September 20, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"
"date": "September 20, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"
"date": "September 20, 1958", "artist": "Poni-Tails", "song": "Born Too Late"
"date": "September 20, 1958", "artist": "The Olympics", "song": "Western Movies"
"date": "September 20, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"
"date": "September 20, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"
"date": "September 27, 1958", "artist": "Domenico Modugno", "song": "Nel Blu Dipinto Di Blu (Volar\u00c3\u00a9)"
"date": "September 27, 1958", "artist": "The Everly Brothers", "song": "Bird Dog"
"date": "September 27, 1958", "artist": "Tommy Edwards", "song": "It's All In The Game"
"date": "September 27, 1958", "artist": "The Elegants", "song": "Little Star"
"date": "September 27, 1958", "artist": "Jimmy Clanton And His Rockets", "song": "Just A Dream"
"date": "September 27, 1958", "artist": "Little Anthony And The Imperials", "song": "Tears On My Pillow"
"date": "September 27, 1958", "artist": "Robin Luke", "song": "Susie Darlin'"
...

另外,看看grequests 作为替代工具。

希望对您有所帮助。

【讨论】:

@Ben321 考虑接受答案,谢谢。

以上是关于如何使用 Urllib2 更有效地抓取?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用延迟python代码更慢地抓取[重复]

python爬虫_urllib2库的基本使用

python爬虫_urllib2库的基本使用

如何更有效地使用@Nullable 和@Nonnull 注解?

如何更有效地使用 RunApp 功能来更改页面

如何使用 LESS 更有效地使用媒体查询?