爬虫模块之解决IO

Posted 2020-10-20 方杰0410

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了爬虫模块之解决IO相关的知识，希望对你有一定的参考价值。

一 asyncio模块

　asyncio模块：主要是帮我们检测IO（只能是网路IO）。

　@asyncio.coroutine：装饰器

　tasks：任务列表

　get_event_loop：起任务

　run_until_complete：提交的方式，检测任务的执行

　asgncio.gather（任务列表）：直接执行任务

　close：关闭任务

　open_connection：建立链接

　yield from：如果阻塞就切换到另外一个任务

　sleep：模仿网络阻塞IO

　write：将数据包准备好

　send.drain：发送数据包

　read：接收数据

# import asyncio
#
# @asyncio.coroutine
# def task(task_id,senconds):
#     print(\'%s is runing\' %task_id)
#     yield from asyncio.sleep(senconds)
#     print(\'%s is done\' %task_id)
#
#
# tasks=[
#     task(1,3),
#     task(2,2),
#     task(3,1)
# ]
#
# loop=asyncio.get_event_loop()
# loop.run_until_complete(asyncio.gather(*tasks))
# loop.close()


#1、按照TCP：建立连接（IO阻塞）
#2、按照HTTP协议：url，请求方法，请求头，请求头
#3、发送Request请求（IO）
#4、接收Respone响应（IO）
import asyncio

@asyncio.coroutine
def get_page(host,port=80,url=\'/\'): #https://  www.baidu.com:80  /
    print(\'GET：%s\' %host)
    recv,send=yield from asyncio.open_connection(host=host,port=port)

    http_pk="""GET %s HTTP/1.1\\r\\nHost:%s\\r\\n\\r\\n""" %(url,host)
    send.write(http_pk.encode(\'utf-8\'))

    yield from send.drain()

    text=yield from recv.read()

    print(\'host:%s size:%s\' %(host,len(text)))

    #解析功能



#http://www.cnblogs.com/linhaifeng/articles/7806303.html
#https://wiki.python.org/moin/BeginnersGuide
#https://www.baidu.com/

tasks=[
    get_page(\'www.cnblogs.com\',url=\'/linhaifeng/articles/7806303.html\'),
    get_page(\'wiki.python.org\',url=\'/moin/BeginnersGuide\'),
    get_page(\'www.baidu.com\',),
]

loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

View Code

二 aiohttp模块

　aiohttp.request：发送一个request请求

import asyncio
import aiohttp #pip3 install aiohttp

@asyncio.coroutine
def get_page(url): #https://  www.baidu.com:80  /
    print(\'GET：%s\' %url)
    response=yield from aiohttp.request(\'GET\',url=url)

    data=yield from response.read()

    print(\'url:%s size:%s\' %(url,len(data)))


#http://www.cnblogs.com/linhaifeng/articles/7806303.html
#https://wiki.python.org/moin/BeginnersGuide
#https://www.baidu.com/

tasks=[
    get_page(\'http://www.cnblogs.com/linhaifeng/articles/7806303.html\'),
    get_page(\'https://wiki.python.org/moin/BeginnersGuide\'),
    get_page(\'https://www.baidu.com/\',),
]

loop=asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()

View Code

三 twisted模块

　twisted：异步IO框架模块

　getpage：发送请求

　internet.reactor：

　addCalllback：绑定回调函数

　defer.DeferredList：

　reactor.run：起循环来负责执行任务

　addBoth：所有的任务都执行完毕过后执行的事，接收的参数是回调函数返回的结果

　reactor.stop：终止程序的执行

\'\'\'
#问题一：error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
pip3 install C:\\Users\\Administrator\\Downloads\\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
pip3 install twisted

#问题二：ModuleNotFoundError: No module named \'win32api\'
https://sourceforge.net/projects/pywin32/files/pywin32/

#问题三：openssl
pip3 install pyopenssl
\'\'\'

#twisted基本用法
from twisted.web.client import getPage,defer
from twisted.internet import reactor

def all_done(arg):
    # print(arg)
    reactor.stop()

def callback(res):
    print(res)
    return 1

defer_list=[]
urls=[
    \'http://www.baidu.com\',
    \'http://www.bing.com\',
    \'https://www.python.org\',
]
for url in urls:
    obj=getPage(url.encode(\'utf=-8\'),)
    obj.addCallback(callback)
    defer_list.append(obj)

defer.DeferredList(defer_list).addBoth(all_done)

reactor.run()




#twisted的getPage的详细用法
from twisted.internet import reactor
from twisted.web.client import getPage
import urllib.parse


def one_done(arg):
    print(arg)
    reactor.stop()

post_data = urllib.parse.urlencode({\'check_data\': \'adf\'})
post_data = bytes(post_data, encoding=\'utf8\')
headers = {b\'Content-Type\': b\'application/x-www-form-urlencoded\'}
response = getPage(bytes(\'http://dig.chouti.com/login\', encoding=\'utf8\'),
                   method=bytes(\'POST\', encoding=\'utf8\'),
                   postdata=post_data,
                   cookies={},
                   headers=headers)
response.addBoth(one_done)

reactor.run()

View Code

四 trnado模块

from tornado.httpclient import AsyncHTTPClient
from tornado.httpclient import HTTPRequest
from tornado import ioloop


def handle_response(response):
    """
    处理返回值内容（需要维护计数器，来停止IO循环），调用 ioloop.IOLoop.current().stop()
    :param response: 
    :return: 
    """
    if response.error:
        print("Error:", response.error)
    else:
        print(response.body)


def func():
    url_list = [
        \'http://www.baidu.com\',
        \'http://www.bing.com\',
    ]
    for url in url_list:
        print(url)
        http_client = AsyncHTTPClient()
        http_client.fetch(HTTPRequest(url), handle_response)


ioloop.IOLoop.current().add_callback(func)
ioloop.IOLoop.current().start()

View Code

以上是关于爬虫模块之解决IO的主要内容，如果未能解决你的问题，请参考以下文章