为啥 BeautifulSoup 与“从未检索到任务异常”相关?

Posted

技术标签:

【中文标题】为啥 BeautifulSoup 与“从未检索到任务异常”相关?【英文标题】:Why is BeautifulSoup related to 'Task exception was never retrieved'?为什么 BeautifulSoup 与“从未检索到任务异常”相关? 【发布时间】:2017-08-06 10:50:03 【问题描述】:

我想使用协程来爬取和解析网页。我写了一个样本并测试。该程序可以在 ubuntu 16.04 中的 python 3.5 中运行良好,并且在完成所有工作后将退出。源代码如下。

import aiohttp
import asyncio
from bs4 import BeautifulSoup

async def coro():
    coro_loop = asyncio.get_event_loop()
    url = u'https://www.python.org/'
    for _ in range(4):
        async with aiohttp.ClientSession(loop=coro_loop) as coro_session:
            with aiohttp.Timeout(30, loop=coro_session.loop):
                async with coro_session.get(url) as resp:
                    print('get response from url: %s' % url)
                    source_code = await resp.read()
                    soup = BeautifulSoup(source_code, 'lxml')

def main():
    loop = asyncio.get_event_loop()
    worker = loop.create_task(coro())
    try:
        loop.run_until_complete(worker)
    except KeyboardInterrupt:
        print ('keyboard interrupt')
        worker.cancel()
    finally:
        loop.stop()
        loop.run_forever()
        loop.close()

if __name__ == '__main__':
    main()

在测试时,我发现当我按'Ctrl+C'关闭程序时,会出现错误'Task exception was never retrieved'。

^Ckeyboard interrupt
Task exception was never retrieved
future: <Task finished coro=<coro() done, defined at ./test.py:8> exception=KeyboardInterrupt()>
Traceback (most recent call last):
  File "./test.py", line 23, in main
    loop.run_until_complete(worker)
  File "/usr/lib/python3.5/asyncio/base_events.py", line 375, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 345, in run_forever
    self._run_once()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 1312, in _run_once
    handle._run()
  File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
    self._callback(*self._args)
  File "/usr/lib/python3.5/asyncio/tasks.py", line 307, in _wakeup
    self._step()
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "./test.py", line 17, in coro
    soup = BeautifulSoup(source_code, 'lxml')
  File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 215, in __init__
    self._feed()
  File "/usr/lib/python3/dist-packages/bs4/__init__.py", line 239, in _feed
    self.builder.feed(self.markup)
  File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 240, in feed
    self.parser.feed(markup)
  File "src/lxml/parser.pxi", line 1194, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:119773)
  File "src/lxml/parser.pxi", line 1316, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:119644)
  File "src/lxml/parsertarget.pxi", line 141, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:137264)
  File "src/lxml/parsertarget.pxi", line 135, in lxml.etree._TargetParserContext._handleParseResult (src/lxml/lxml.etree.c:137128)
  File "src/lxml/lxml.etree.pyx", line 324, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:11090)
  File "src/lxml/saxparser.pxi", line 499, in lxml.etree._handleSaxData (src/lxml/lxml.etree.c:131013)
  File "src/lxml/parsertarget.pxi", line 88, in lxml.etree._PythonSaxParserTarget._handleSaxData (src/lxml/lxml.etree.c:136397)
  File "/usr/lib/python3/dist-packages/bs4/builder/_lxml.py", line 206, in data
    def data(self, content):
KeyboardInterrupt

我查看了the offical docs of python,但没有任何线索。我尝试在 coro() 中捕获键盘中断。

try:
    soup = BeautifulSoup(source_code, 'lxml')
except KeyboardInterrupt:
    print ('capture exception')
    raise

每次 BeautifulSoup() 周围的 'try/except' 捕获 KeyboardInterrupt 时,都会发生错误。似乎 BeautifulSoup 导致了错误。但是如何解决呢?

【问题讨论】:

这与 BeautifulSoup 无关。当您没有检索任务内部引发的异常时,就会出现该警告。您需要在某处添加对worker.exception() 的呼叫。 【参考方案1】:

当您调用task.cancel() 时,此函数实际上并没有取消任务,它只是“标记”要取消的任务。取消任务的实际过程将在任务恢复执行时开始。 asyncio.CancelledError 将立即在任务内部引发,强制它被实际取消。任务将完成此异常的执行。

另一方面,如果您的某些任务以异常方式以静默方式完成(如果您没有检查任务执行的结果),则 asyncio 会警告您。

为避免出现问题,您应该等待收到asyncio.CancelledError 的任务取消(并且可能会因为您不需要它而被压制):

import asyncio
from contextlib import suppress


async def coro():
    # ...

def main():
    loop = asyncio.get_event_loop()
    worker = asyncio.ensure_future(coro())
    try:
        loop.run_until_complete(worker)
    except KeyboardInterrupt:
        print('keyboard interrupt')

        worker.cancel()
        with suppress(asyncio.CancelledError):
            loop.run_until_complete(worker)  # await task cancellation.
    finally:
        loop.close()

if __name__ == '__main__':
    main()

【讨论】:

以上是关于为啥 BeautifulSoup 与“从未检索到任务异常”相关?的主要内容,如果未能解决你的问题,请参考以下文章

如果我们可以使用 Selenium,为啥还需要像 BeautifulSoup 这样的解析器?

安装BeautifulSoup库成功但是为啥导入出错

为啥 BeautifulSoup 无法正确读取/解析此 RSS (XML) 文档?

为啥我在 Python 中使用 BeautifulSoup 得到“'ResultSet' 没有属性 'findAll'”?

为啥当我使用 BeautifulSoup 遍历我的文件时会得到相同的值?

为啥使用 BeautifulSoup find_all 方法会导致错误(列表索引超出范围)?