Python笔记
Posted 哈哈丶Stupid
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python笔记相关的知识,希望对你有一定的参考价值。
最近几天在研究爬虫爬取音频网站如何高效的访问并下载,说实话,我一开始还不知道有协程这个东东~~~并且之前一直觉得爬网站用啥方法都一样,能爬就行,自从发现了协程,爱不释手~~~~~
好了废话少说~进入正题:
首先,我们先说一下什么是多线程,在网上的一些教程中都给出这个例子(涉及类和队列)
1 #Example.py 2 \'\'\' 3 Standard Producer/Consumer Threading Pattern 4 \'\'\' 5 6 import time 7 import threading 8 import Queue 9 10 class Consumer(threading.Thread): 11 def __init__(self, queue): 12 threading.Thread.__init__(self) 13 self._queue = queue 14 15 def run(self): 16 while True: 17 # queue.get() blocks the current thread until 18 # an item is retrieved. 19 msg = self._queue.get() 20 # Checks if the current message is 21 # the "Poison Pill" 22 if isinstance(msg, str) and msg == \'quit\': 23 # if so, exists the loop 24 break 25 # "Processes" (or in our case, prints) the queue item 26 print "I\'m a thread, and I received %s!!" % msg 27 # Always be friendly! 28 print \'Bye byes!\' 29 30 31 def Producer(): 32 # Queue is used to share items between 33 # the threads. 34 queue = Queue.Queue() 35 36 # Create an instance of the worker 37 worker = Consumer(queue) 38 # start calls the internal run() method to 39 # kick off the thread 40 worker.start() 41 42 # variable to keep track of when we started 43 start_time = time.time() 44 # While under 5 seconds.. 45 while time.time() - start_time < 5: 46 # "Produce" a piece of work and stick it in 47 # the queue for the Consumer to process 48 queue.put(\'something at %s\' % time.time()) 49 # Sleep a bit just to avoid an absurd number of messages 50 time.sleep(1) 51 52 # This the "poison pill" method of killing a thread. 53 queue.put(\'quit\') 54 # wait for the thread to close down 55 worker.join() 56 57 58 if __name__ == \'__main__\': 59 Producer()
这个栗子让我想起了大三的课设虽然用的是java实现生产者-消费者模式~~~万恶的java
第一点你需要一个model类,其次需要一个队列来传递对象,是的如果你需要双向通信的话则需要再加入一个一个队列。。
下面,我们需要一个worker线程池,在网页检索中通过多线程加速~~~
1 #Example2.py 2 \'\'\' 3 A more realistic thread pool example 4 \'\'\' 5 6 import time 7 import threading 8 import Queue 9 import urllib2 10 11 class Consumer(threading.Thread): 12 def __init__(self, queue): 13 threading.Thread.__init__(self) 14 self._queue = queue 15 16 def run(self): 17 while True: 18 content = self._queue.get() 19 if isinstance(content, str) and content == \'quit\': 20 break 21 response = urllib2.urlopen(content) 22 print \'Bye byes!\' 23 24 25 def Producer(): 26 urls = [ 27 \'http://www.python.org\', \'http://www.yahoo.com\' 28 \'http://www.scala.org\', \'http://www.google.com\' 29 # etc.. 30 ] 31 queue = Queue.Queue() 32 worker_threads = build_worker_pool(queue, 4) 33 start_time = time.time() 34 35 # Add the urls to process 36 for url in urls: 37 queue.put(url) 38 # Add the poison pillv 39 for worker in worker_threads: 40 queue.put(\'quit\') 41 for worker in worker_threads: 42 worker.join() 43 44 print \'Done! Time taken: {}\'.format(time.time() - start_time) 45 46 def build_worker_pool(queue, size): 47 workers = [] 48 for _ in range(size): 49 worker = Consumer(queue) 50 worker.start() 51 workers.append(worker) 52 return workers 53 54 if __name__ == \'__main__\': 55 Producer()
可怕,我写不下去了,太复杂了。
我们换一个方法~~~map
1 urls = [\'http://www.yahoo.com\', \'http://www.reddit.com\']
2 results = map(urllib2.urlopen, urls)
对,你没有看错,直接就可以了,result直接遍历urls然后返回的结果存在里面类型是list。
map直接可以实现并行操作了
线程过多,切换时所消耗的时间甚至超过实际工作时间,对于实际工作需求,通过尝试来找到合适线程池大小最合适不过了
1 import urllib2 2 from multiprocessing.dummy import Pool as ThreadPool 3 4 urls = [ 5 \'http://www.python.org\', 6 \'http://www.python.org/about/\', 7 \'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html\', 8 \'http://www.python.org/doc/\', 9 \'http://www.python.org/download/\', 10 \'http://www.python.org/getit/\', 11 \'http://www.python.org/community/\', 12 \'https://wiki.python.org/moin/\', 13 \'http://planet.python.org/\', 14 \'https://wiki.python.org/moin/LocalUserGroups\', 15 \'http://www.python.org/psf/\', 16 \'http://docs.python.org/devguide/\', 17 \'http://www.python.org/community/awards/\' 18 # etc.. 19 ] 20 21 # Make the Pool of workers 22 pool = ThreadPool(4) 23 # Open the urls in their own threads 24 # and return the results 25 results = pool.map(urllib2.urlopen, urls) 26 #close the pool and wait for the work to finish 27 pool.close() 28 pool.join()
其中最关键只有一行,map函数轻易取代前文40多行复杂的例子。
说了这么多,我要讲的东西还没说~~~无语
好了,我要开始了~~~
我用了半天时间研究了一下喜马拉雅听书的网站,并成功写了一个单进程的小爬虫,下载所需要的类型的全部音频资源~~~~听起来很酷,其实吧爬的时候真的慢,于是乎我就加入了协程这个东西,发现速度上去了一大截,其实就是缩短了访问的时间我主要加了两个协程
1 #第一个协程 2 def get_url(): 3 try: 4 start_urls = [\'http://www.ximalaya.com/dq/book-%E6%82%AC%E7%96%91/{}/\'.format(pn) for pn in range(1,85)] 5 print(start_urls) 6 urls_list = [] 7 for start_url in start_urls: 8 urls_list.append(start_url) 9 # response = fetch_url_text(start_url) 10 # soup = BeautifulSoup(response,\'lxml\') 11 # print(soup) 12 # break 13 # for item in soup.find_all(\'div\',class_=\'albumfaceOutter\'): 14 # print(item) 15 # href = item.a[\'href\'] 16 # title = item.img[\'alt\'] 17 # img_url = item.img[\'src\'] 18 # # print(title) 19 # content = { 20 # \'href\':href, 21 # \'title\':title, 22 # \'img_url\':img_url 23 # } 24 # get_mp3(href,title) 25 #协程模块1 26 jobs = [gevent.spawn(fetch_url_text,url) for url in urls_list] 27 gevent.joinall(jobs) 28 [job.value for job in jobs] 29 for response in [job.value for job in jobs]: 30 soup = BeautifulSoup(response,\'lxml\') 31 for item in soup.find_all(\'div\',class_=\'albumfaceOutter\'): 32 print(item) 33 href = item.a[\'href\'] 34 title = item.img[\'alt\'] 35 img_url = item.img[\'src\'] 36 # print(title) 37 content = { 38 \'href\':href, 39 \'title\':title, 40 \'img_url\':img_url 41 } 42 get_mp3(href,title) 43 except Exception as e: 44 print(e) 45 return \'\'
第二个协程
1 def get_mp3(url,title): 2 response = fetch_url_text(url) 3 num_list = etree.HTML(response).xpath(\'//div[@class="personal_body"]/@sound_ids\')[0].split(\',\') 4 print(num_list) 5 6 mkdir(title) 7 os.chdir(r\'F:\\xmly\\\\\'+title) 8 ii=1 9 list_ids = [] 10 for id in num_list: 11 list_ids.append(id) 12 # print(id) 13 # # json_url = \'http://www.ximalaya.com/tracks/{}.json\'.format(id) 14 # html = fetch_json(id) 15 # # print(html) 16 # mp3_url = html.get(\'play_path\') 17 # # print(mp3_url) 18 # # download(mp3_url) 19 # content = requests.get(mp3_url, headers=headers).content 20 # name = title + \'_%i.m4a\'%ii 21 # with open(name, \'wb\') as file: 22 # file.write(content) 23 # print("{} download is ok".format(mp3_url)) 24 # ii+=1 25 #协程模块2 26 jobs = [gevent.spawn(fetch_json,id) for id in list_ids] 27 gevent.joinall(jobs) 28 [job.value for job in jobs] 29 for html in [job.value for job in jobs]: 30 mp3_url = html.get(\'play_path\') 31 content = requests.get(mp3_url, headers=headers).content 32 name = title + \'_%i.m4a\' % ii 33 with open(name, \'wb\') as file: 34 file.write(content) 35 print("{} download is ok".format(mp3_url)) 36 ii += 1
代码很冗余啊~~~看看就好了,毕竟刚学,现学现用嘛~~~~
其实吧,我感觉速度还是不够
可以再添加多进程模块进去,速度可能又会上去,如果再分布式呢?岂不是分分钟爬完整个站的内容,哈哈
好了,不做白日梦了,我只有一个笔记本,而且实验室服务器就两个,并且都在跑训练模型,就不给他们添加负担了~~~~~
还没说协程是什么啊~~~~~
协程可以理解为绿色的线程,或者微线程,起作用是在执行function_a()时可以随时中断去执行function_b(),然后中断继续function_a(),可自由切换。整个过程看似多线程,然而只有一个线程执行,也就是gevent中的monkeyAPI实现了异步的过程~~~~好神奇是不是
1 #! -*- coding:utf-8 -*- 2 #version:2.7 3 import gevent 4 from gevent import monkey;monkey.patch_all() 5 import urllib2 6 def get_body(i): 7 print "start",i 8 urllib2.urlopen("http://cn.bing.com") 9 print "end",i 10 tasks=[gevent.spawn(get_body,i) for i in range(3)] 11 gevent.joinall(tasks)
这个简单栗子应该很容易明白~~~~~
我总结了一下,只要是有for而且其中每次都需要大量时间做请求的程序都可以调用协程来完成~~~~~
我的理解可能有偏差,欢迎指正~~~~
总结一下,并发和并行吧~~~~~
并发,就相当于有两个队伍,一个caffe机,轮流用这个caffe机
并行,就相当于有两个队伍,两个coffe机,一个队伍用一个,且同时进行。~~~~~
是不是很浅显~~~~~
拜拜~~~~~~~~~~
以上是关于Python笔记的主要内容,如果未能解决你的问题,请参考以下文章