Python笔记

Posted 哈哈丶Stupid

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python笔记相关的知识,希望对你有一定的参考价值。

最近几天在研究爬虫爬取音频网站如何高效的访问并下载,说实话,我一开始还不知道有协程这个东东~~~并且之前一直觉得爬网站用啥方法都一样,能爬就行,自从发现了协程,爱不释手~~~~~

 

好了废话少说~进入正题:

首先,我们先说一下什么是多线程,在网上的一些教程中都给出这个例子(涉及类和队列)

 

 1 #Example.py
 2 \'\'\'
 3 Standard Producer/Consumer Threading Pattern
 4 \'\'\'
 5 
 6 import time 
 7 import threading 
 8 import Queue 
 9 
10 class Consumer(threading.Thread): 
11     def __init__(self, queue): 
12         threading.Thread.__init__(self)
13         self._queue = queue 
14 
15     def run(self):
16         while True: 
17             # queue.get() blocks the current thread until 
18             # an item is retrieved. 
19             msg = self._queue.get() 
20             # Checks if the current message is 
21             # the "Poison Pill"
22             if isinstance(msg, str) and msg == \'quit\':
23                 # if so, exists the loop
24                 break
25             # "Processes" (or in our case, prints) the queue item   
26             print "I\'m a thread, and I received %s!!" % msg
27         # Always be friendly! 
28         print \'Bye byes!\'
29 
30 
31 def Producer():
32     # Queue is used to share items between
33     # the threads.
34     queue = Queue.Queue()
35 
36     # Create an instance of the worker
37     worker = Consumer(queue)
38     # start calls the internal run() method to 
39     # kick off the thread
40     worker.start() 
41 
42     # variable to keep track of when we started
43     start_time = time.time() 
44     # While under 5 seconds.. 
45     while time.time() - start_time < 5: 
46         # "Produce" a piece of work and stick it in 
47         # the queue for the Consumer to process
48         queue.put(\'something at %s\' % time.time())
49         # Sleep a bit just to avoid an absurd number of messages
50         time.sleep(1)
51 
52     # This the "poison pill" method of killing a thread. 
53     queue.put(\'quit\')
54     # wait for the thread to close down
55     worker.join()
56 
57 
58 if __name__ == \'__main__\':
59     Producer()

 

这个栗子让我想起了大三的课设虽然用的是java实现生产者-消费者模式~~~万恶的java

第一点你需要一个model类,其次需要一个队列来传递对象,是的如果你需要双向通信的话则需要再加入一个一个队列。。

 

下面,我们需要一个worker线程池,在网页检索中通过多线程加速~~~

 1 #Example2.py
 2 \'\'\'
 3 A more realistic thread pool example 
 4 \'\'\'
 5 
 6 import time 
 7 import threading 
 8 import Queue 
 9 import urllib2 
10 
11 class Consumer(threading.Thread): 
12     def __init__(self, queue): 
13         threading.Thread.__init__(self)
14         self._queue = queue 
15 
16     def run(self):
17         while True: 
18             content = self._queue.get() 
19             if isinstance(content, str) and content == \'quit\':
20                 break
21             response = urllib2.urlopen(content)
22         print \'Bye byes!\'
23 
24 
25 def Producer():
26     urls = [
27         \'http://www.python.org\', \'http://www.yahoo.com\'
28         \'http://www.scala.org\', \'http://www.google.com\'
29         # etc.. 
30     ]
31     queue = Queue.Queue()
32     worker_threads = build_worker_pool(queue, 4)
33     start_time = time.time()
34 
35     # Add the urls to process
36     for url in urls: 
37         queue.put(url)  
38     # Add the poison pillv
39     for worker in worker_threads:
40         queue.put(\'quit\')
41     for worker in worker_threads:
42         worker.join()
43 
44     print \'Done! Time taken: {}\'.format(time.time() - start_time)
45 
46 def build_worker_pool(queue, size):
47     workers = []
48     for _ in range(size):
49         worker = Consumer(queue)
50         worker.start() 
51         workers.append(worker)
52     return workers
53 
54 if __name__ == \'__main__\':
55     Producer()

可怕,我写不下去了,太复杂了。

我们换一个方法~~~map

1 urls = [\'http://www.yahoo.com\', \'http://www.reddit.com\']

 2 results = map(urllib2.urlopen, urls) 

对,你没有看错,直接就可以了,result直接遍历urls然后返回的结果存在里面类型是list。

map直接可以实现并行操作了

 

 

 线程过多,切换时所消耗的时间甚至超过实际工作时间,对于实际工作需求,通过尝试来找到合适线程池大小最合适不过了

 1 import urllib2 
 2 from multiprocessing.dummy import Pool as ThreadPool 
 3 
 4 urls = [
 5     \'http://www.python.org\', 
 6     \'http://www.python.org/about/\',
 7     \'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html\',
 8     \'http://www.python.org/doc/\',
 9     \'http://www.python.org/download/\',
10     \'http://www.python.org/getit/\',
11     \'http://www.python.org/community/\',
12     \'https://wiki.python.org/moin/\',
13     \'http://planet.python.org/\',
14     \'https://wiki.python.org/moin/LocalUserGroups\',
15     \'http://www.python.org/psf/\',
16     \'http://docs.python.org/devguide/\',
17     \'http://www.python.org/community/awards/\'
18     # etc.. 
19     ]
20 
21 # Make the Pool of workers
22 pool = ThreadPool(4) 
23 # Open the urls in their own threads
24 # and return the results
25 results = pool.map(urllib2.urlopen, urls)
26 #close the pool and wait for the work to finish 
27 pool.close() 
28 pool.join() 

其中最关键只有一行,map函数轻易取代前文40多行复杂的例子。

 说了这么多,我要讲的东西还没说~~~无语

好了,我要开始了~~~

我用了半天时间研究了一下喜马拉雅听书的网站,并成功写了一个单进程的小爬虫,下载所需要的类型的全部音频资源~~~~听起来很酷,其实吧爬的时候真的慢,于是乎我就加入了协程这个东西,发现速度上去了一大截,其实就是缩短了访问的时间我主要加了两个协程

 1 #第一个协程
 2 def get_url():
 3     try:
 4         start_urls = [\'http://www.ximalaya.com/dq/book-%E6%82%AC%E7%96%91/{}/\'.format(pn) for pn in range(1,85)]
 5         print(start_urls)
 6         urls_list = []
 7         for start_url in start_urls:
 8             urls_list.append(start_url)
 9             # response = fetch_url_text(start_url)
10             # soup  = BeautifulSoup(response,\'lxml\')
11             # print(soup)
12             # break
13             # for item in soup.find_all(\'div\',class_=\'albumfaceOutter\'):
14             #     print(item)
15             #     href = item.a[\'href\']
16             #     title = item.img[\'alt\']
17             #     img_url = item.img[\'src\']
18             #     # print(title)
19             #     content = {
20             #         \'href\':href,
21             #         \'title\':title,
22             #         \'img_url\':img_url
23             #     }
24             #     get_mp3(href,title)
25         #协程模块1
26         jobs = [gevent.spawn(fetch_url_text,url) for url in urls_list]
27         gevent.joinall(jobs)
28         [job.value for job in jobs]
29         for response in  [job.value for job in jobs]:
30             soup = BeautifulSoup(response,\'lxml\')
31             for item in soup.find_all(\'div\',class_=\'albumfaceOutter\'):
32                 print(item)
33                 href = item.a[\'href\']
34                 title = item.img[\'alt\']
35                 img_url = item.img[\'src\']
36                 # print(title)
37                 content = {
38                     \'href\':href,
39                     \'title\':title,
40                     \'img_url\':img_url
41                 }
42                 get_mp3(href,title)
43     except Exception as e:
44        print(e)
45     return \'\'

第二个协程

 1 def get_mp3(url,title):
 2     response = fetch_url_text(url)
 3     num_list = etree.HTML(response).xpath(\'//div[@class="personal_body"]/@sound_ids\')[0].split(\',\')
 4     print(num_list)
 5 
 6     mkdir(title)
 7     os.chdir(r\'F:\\xmly\\\\\'+title)
 8     ii=1
 9     list_ids = []
10     for id in num_list:
11         list_ids.append(id)
12         # print(id)
13         # # json_url = \'http://www.ximalaya.com/tracks/{}.json\'.format(id)
14         # html = fetch_json(id)
15         # # print(html)
16         # mp3_url = html.get(\'play_path\')
17         # # print(mp3_url)
18         # # download(mp3_url)
19         # content = requests.get(mp3_url, headers=headers).content
20         # name = title + \'_%i.m4a\'%ii
21         # with open(name, \'wb\') as file:
22         #     file.write(content)
23         # print("{} download is ok".format(mp3_url))
24         # ii+=1
25     #协程模块2
26     jobs = [gevent.spawn(fetch_json,id) for id in list_ids]
27     gevent.joinall(jobs)
28     [job.value for job in jobs]
29     for html in [job.value for job in jobs]:
30         mp3_url = html.get(\'play_path\')
31         content = requests.get(mp3_url, headers=headers).content
32         name = title + \'_%i.m4a\' % ii
33         with open(name, \'wb\') as file:
34             file.write(content)
35         print("{} download is ok".format(mp3_url))
36         ii += 1

代码很冗余啊~~~看看就好了,毕竟刚学,现学现用嘛~~~~

其实吧,我感觉速度还是不够

可以再添加多进程模块进去,速度可能又会上去,如果再分布式呢?岂不是分分钟爬完整个站的内容,哈哈

好了,不做白日梦了,我只有一个笔记本,而且实验室服务器就两个,并且都在跑训练模型,就不给他们添加负担了~~~~~

还没说协程是什么啊~~~~~

协程可以理解为绿色的线程,或者微线程,起作用是在执行function_a()时可以随时中断去执行function_b(),然后中断继续function_a(),可自由切换。整个过程看似多线程,然而只有一个线程执行,也就是gevent中的monkeyAPI实现了异步的过程~~~~好神奇是不是

 1 #! -*- coding:utf-8 -*-
 2 #version:2.7
 3 import gevent
 4 from gevent import monkey;monkey.patch_all()
 5 import urllib2
 6 def get_body(i):
 7     print "start",i
 8     urllib2.urlopen("http://cn.bing.com")
 9     print "end",i
10 tasks=[gevent.spawn(get_body,i) for i in range(3)]
11 gevent.joinall(tasks)

这个简单栗子应该很容易明白~~~~~

我总结了一下,只要是有for而且其中每次都需要大量时间做请求的程序都可以调用协程来完成~~~~~

我的理解可能有偏差,欢迎指正~~~~

总结一下,并发和并行吧~~~~~

并发,就相当于有两个队伍,一个caffe机,轮流用这个caffe机

并行,就相当于有两个队伍,两个coffe机,一个队伍用一个,且同时进行。~~~~~

是不是很浅显~~~~~

拜拜~~~~~~~~~~

以上是关于Python笔记的主要内容,如果未能解决你的问题,请参考以下文章

Python 3学习笔记

Python Tornado初学笔记之表单与模板

常用python日期日志获取内容循环的代码片段

python 有用的Python代码片段

Python 向 Postman 请求代码片段

python [代码片段]一些有趣的代码#sort