lxml / BeautifulSoup 解析器警告

Posted 2023-02-23

技术标签:

【中文标题】lxml / BeautifulSoup 解析器警告【英文标题】：lxml / BeautifulSoup parser warning 【发布时间】：2018-10-07 07:37:40 【问题描述】：

使用 Python 3，我试图通过将 lxml 与 BeautifulSoup 一起使用来解析丑陋的 html（不受我控制），如下所述：http://lxml.de/elementsoup.html

具体来说，我想使用 lxml，但我想使用 BeautifulSoup，因为就像我说的，它是丑陋的 HTML，lxml 会自行拒绝它。

上面的链接说：“你需要做的就是将它传递给 fromstring() 函数：”

from lxml.html.soupparser import fromstring
root = fromstring(tag_soup)

这就是我正在做的事情：

URL = 'http://some-place-on-the-internet.com'
html_goo = requests.get(URL).text
root = fromstring(html_goo)

它工作，因为在那之后我可以很好地操作 HTML。我的问题是每次我运行脚本时，都会收到这个烦人的警告：

/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

我的问题可能很明显：我自己没有实例化 BeautifulSoup。我尝试将建议的参数添加到fromstring 函数中，但这只是给了我错误：TypeError: 'str' object is not callable。到目前为止，在线搜索已被证明是徒劳的。

我想去掉那个警告信息。感谢您的帮助，在此先感谢。

【问题讨论】：

你试过root = lxml.html.soupparser.fromstring(html_goo) 吗？那行不通，因为lxml 是一个模块。但我试过了，它给了我：NameError: name 'lxml' is not defined（我认为这是有道理的）不能用soupparser.fromstring(html_goo)调用吗？当这直接来自他们的文档时，我很惊讶它不起作用：p 不，运行 soupparser.fromstring(...) 而不是 fromstring(...) 会产生相同的结果。我也觉得这很奇怪。没问题！ ^^ 您的解决方案几乎使用与我上一条评论完全相同的推理，您可以通过 fromstring() 传递构造函数参数，这对您找到解决方案很有帮助！祝你好运！ 【参考方案1】：

我必须阅读 lxml 和 BeautifulSoup 的源代码才能弄清楚这一点。

我在这里发布我自己的答案，以防其他人将来需要它。

有问题的fromstring 函数是这样定义的：

def fromstring(data, beautifulsoup=None, makeelement=None, **bsargs):

**bsargs 参数最终被发送到 BeautifulSoup 构造函数，它的调用方式如下（在另一个函数中，_parse）：

tree = beautifulsoup(source, **bsargs)

BeautifulSoup 构造函数是这样定义的：

def __init__(self, markup="", features=None, builder=None,
             parse_only=None, from_encoding=None, exclude_encodings=None,
             **kwargs):

现在，回到问题中的警告，建议将参数“html.parser”添加到 BeautifulSoup 的构造函数中。据此，这将是名为features 的参数。

由于 fromstring 函数会将命名参数传递给 BeautifulSoup 的构造函数，我们可以通过将参数命名为 fromstring 函数来指定解析器，如下所示：

root = fromstring(clean, features='html.parser')

噗。警告消失。

【讨论】：

【参考方案2】：

在使用 BeautifulSoup 时，我们总是执行以下操作：

[variable] = BeautifulSoup([contents you want to analyze])

问题来了：

如果您之前安装过“lxml”，BeautifulSoup 会自动注意到它使用它作为 praser。这不是错误，只是一个通知。

那么如何去除呢？

只需像下面这样：

[variable] = BeautifulSoup([contents you want to analyze], features = "lxml")

“基于 BeautifulSoup 的最新版本，4.6.3”

注意不同版本的BeautifulSoup有不同的方式，或者语法，添加这个模式，仔细看通知信息就行了。

祝你好运！

【讨论】：

如果你的内容是 XML 使用这个soup = BeautifulSoup(html_content_in_xml, 'lxml')，如果你的内容是 html 使用这个soup = BeautifulSoup(html_content, 'html.parser')【参考方案3】：

对于其他初始化，例如：

soup = BeautifulSoup(html_doc)

使用

soup = BeautifulSoup(html_doc, 'html.parser')

改为

【讨论】：

以上是关于lxml / BeautifulSoup 解析器警告的主要内容，如果未能解决你的问题，请参考以下文章

python模块--BeautifulSoup4 和 lxml

BeautifulSoup - lxml 和 html5lib 解析器抓取差异

beautifulsoup 无法识别 lxml

Beautifulsoup 与 lxml 与 Beautifulsoup 3

BeautifulSoup：“lxml”、“html.parser”和“html5lib”解析器有啥区别？

Beautifulsoup