爬虫模块之解决IO
Posted 方杰0410
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了爬虫模块之解决IO相关的知识,希望对你有一定的参考价值。
一 asyncio模块
asyncio模块:主要是帮我们检测IO(只能是网路IO)。
@asyncio.coroutine:装饰器
tasks:任务列表
get_event_loop:起任务
run_until_complete:提交的方式,检测任务的执行
asgncio.gather(任务列表):直接执行任务
close:关闭任务
open_connection:建立链接
yield from:如果阻塞就切换到另外一个任务
sleep:模仿网络阻塞IO
write:将数据包准备好
send.drain:发送数据包
read:接收数据
# import asyncio # # @asyncio.coroutine # def task(task_id,senconds): # print(\'%s is runing\' %task_id) # yield from asyncio.sleep(senconds) # print(\'%s is done\' %task_id) # # # tasks=[ # task(1,3), # task(2,2), # task(3,1) # ] # # loop=asyncio.get_event_loop() # loop.run_until_complete(asyncio.gather(*tasks)) # loop.close() #1、按照TCP:建立连接(IO阻塞) #2、按照HTTP协议:url,请求方法,请求头,请求头 #3、发送Request请求(IO) #4、接收Respone响应(IO) import asyncio @asyncio.coroutine def get_page(host,port=80,url=\'/\'): #https:// www.baidu.com:80 / print(\'GET:%s\' %host) recv,send=yield from asyncio.open_connection(host=host,port=port) http_pk="""GET %s HTTP/1.1\\r\\nHost:%s\\r\\n\\r\\n""" %(url,host) send.write(http_pk.encode(\'utf-8\')) yield from send.drain() text=yield from recv.read() print(\'host:%s size:%s\' %(host,len(text))) #解析功能 #http://www.cnblogs.com/linhaifeng/articles/7806303.html #https://wiki.python.org/moin/BeginnersGuide #https://www.baidu.com/ tasks=[ get_page(\'www.cnblogs.com\',url=\'/linhaifeng/articles/7806303.html\'), get_page(\'wiki.python.org\',url=\'/moin/BeginnersGuide\'), get_page(\'www.baidu.com\',), ] loop=asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
二 aiohttp模块
aiohttp.request:发送一个request请求
import asyncio import aiohttp #pip3 install aiohttp @asyncio.coroutine def get_page(url): #https:// www.baidu.com:80 / print(\'GET:%s\' %url) response=yield from aiohttp.request(\'GET\',url=url) data=yield from response.read() print(\'url:%s size:%s\' %(url,len(data))) #http://www.cnblogs.com/linhaifeng/articles/7806303.html #https://wiki.python.org/moin/BeginnersGuide #https://www.baidu.com/ tasks=[ get_page(\'http://www.cnblogs.com/linhaifeng/articles/7806303.html\'), get_page(\'https://wiki.python.org/moin/BeginnersGuide\'), get_page(\'https://www.baidu.com/\',), ] loop=asyncio.get_event_loop() loop.run_until_complete(asyncio.gather(*tasks)) loop.close()
三 twisted模块
twisted:异步IO框架模块
getpage:发送请求
internet.reactor:
addCalllback:绑定回调函数
defer.DeferredList:
reactor.run:起循环来负责执行任务
addBoth:所有的任务都执行完毕过后执行的事,接收的参数是回调函数返回的结果
reactor.stop:终止程序的执行
\'\'\' #问题一:error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted pip3 install C:\\Users\\Administrator\\Downloads\\Twisted-17.9.0-cp36-cp36m-win_amd64.whl pip3 install twisted #问题二:ModuleNotFoundError: No module named \'win32api\' https://sourceforge.net/projects/pywin32/files/pywin32/ #问题三:openssl pip3 install pyopenssl \'\'\' #twisted基本用法 from twisted.web.client import getPage,defer from twisted.internet import reactor def all_done(arg): # print(arg) reactor.stop() def callback(res): print(res) return 1 defer_list=[] urls=[ \'http://www.baidu.com\', \'http://www.bing.com\', \'https://www.python.org\', ] for url in urls: obj=getPage(url.encode(\'utf=-8\'),) obj.addCallback(callback) defer_list.append(obj) defer.DeferredList(defer_list).addBoth(all_done) reactor.run() #twisted的getPage的详细用法 from twisted.internet import reactor from twisted.web.client import getPage import urllib.parse def one_done(arg): print(arg) reactor.stop() post_data = urllib.parse.urlencode({\'check_data\': \'adf\'}) post_data = bytes(post_data, encoding=\'utf8\') headers = {b\'Content-Type\': b\'application/x-www-form-urlencoded\'} response = getPage(bytes(\'http://dig.chouti.com/login\', encoding=\'utf8\'), method=bytes(\'POST\', encoding=\'utf8\'), postdata=post_data, cookies={}, headers=headers) response.addBoth(one_done) reactor.run()
四 trnado模块
from tornado.httpclient import AsyncHTTPClient from tornado.httpclient import HTTPRequest from tornado import ioloop def handle_response(response): """ 处理返回值内容(需要维护计数器,来停止IO循环),调用 ioloop.IOLoop.current().stop() :param response: :return: """ if response.error: print("Error:", response.error) else: print(response.body) def func(): url_list = [ \'http://www.baidu.com\', \'http://www.bing.com\', ] for url in url_list: print(url) http_client = AsyncHTTPClient() http_client.fetch(HTTPRequest(url), handle_response) ioloop.IOLoop.current().add_callback(func) ioloop.IOLoop.current().start()
以上是关于爬虫模块之解决IO的主要内容,如果未能解决你的问题,请参考以下文章