为啥在搜索文本是不是包含字符串时，beautifoulsount find_all 缺少此元素？

Posted 2023-03-05

技术标签:

【中文标题】为啥在搜索文本是不是包含字符串时，beautifoulsount find_all 缺少此元素？【英文标题】：Why is beautifoulsount find_all missing this element when searching if text contains a string?为什么在搜索文本是否包含字符串时，beautifoulsount find_all 缺少此元素？ 【发布时间】：2020-09-17 13:36:25 【问题描述】：

我有这个测试 html 页面



    <html lang="en">
    	<head>
    		<title></title>
    		<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    		<meta name="created" content="2020-05-29T11:12:00.0000000" />
    	</head>
    	<body data-absolute-enabled="true" style="font-family:Calibri;font-size:11pt">
    		<div id="div:659babcd-9de3-0e7a-27ba-7fa0325a40f7216" style="position:absolute;left:72px;top:43px;width:696px">
    			<p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7218" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt"><span style="font-weight:bold">Test1###yrdy</span></p>
    			<p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7220" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test2###qweqwe</p>
    			<p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b22" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt"><span style="color:red">Test3</span> ###qweqeqwe</p>
    			<p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b17" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test4 ### sfsfsdfds</p>
    			<p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b19" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test5### 121212</p>
    		</div>
    	</body>
    </html>

把上面的变成汤后我正在这样做

 for element in soup.find_all(["p"],text=re.compile("###")):
     print(element)

上面打印这些



    <p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7218" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt"><span style="font-weight:bold">Test1###yrdy</span></p>
    <p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7220" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test2###qweqwe</p>
    <p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b17" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test4 ### sfsfsdfds</p>
    <p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b19" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test5### 121212</p>

为什么会跳过Test3对应的p？

更新：安德烈，你的建议有效，但它让我感到困惑，因为搜索 soup.find_all(["p"],text=re.compile("###")) 应该有同样的效果。这个

 for p in soup.html.body.find_all("p"):
    print(p.text)

测试1###yrdy 测试2###qweqwe 测试3 ###qweqeqwe 测试4 ### sfsfsdfds 测试5### 121212

【问题讨论】：

### 丢失。你为什么用regex？向右滚动，在Test3中有一个### ###qweqeqwe 是看python代码（re.compile(' ###')) @MiniMe 这是BeautifulSoup的限制，在这种情况下它是不一致的（我认为）。要获得更可靠的结果，您可以使用带有 lambda 的东西，例如 soup.find_all(lambda tag: tag.name=='p' and '###' in tag.text) @MiniMe 是的，应该有你描述的效果。但它没有——也许是错误，或者 bs4 的内部是如何实现的？也许更有经验的人可以说出为什么会这样。我自己，我很少在参数中使用正则表达式（但是你可以在 lambda 中使用它！）好的，我将不回答这个问题，也许对BS有更多了解的人可以回答这个问题，感谢您的帮助 【参考方案1】：

Test3 在 span, P 标记中没有 ###，以修复使用 soup.select('p')

或者干脆对所有 p 标签使用for element in soup.find_all(['p']):

例子

import re

from bs4 import BeautifulSoup

html = """
    <html lang="en">
        <head>
            <title></title>
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
            <meta name="created" content="2020-05-29T11:12:00.0000000" />
        </head>
        <body data-absolute-enabled="true" style="font-family:Calibri;font-size:11pt">
            <div id="div:659babcd-9de3-0e7a-27ba-7fa0325a40f7216" style="position:absolute;left:72px;top:43px;width:696px">
                <p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7218" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt"><span style="font-weight:bold">Test1###yrdy</span></p>
                <p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7220" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test2###qweqwe</p>
                <p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b22" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt"><span style="color:red">Test3 </span> ###qweqeqwe</p>
                <p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b17" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test4 ### sfsfsdfds</p>
                <p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b19" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test5### 121212</p>
            </div>
        </body>
    </html>
"""

soup = BeautifulSoup(html, features='html.parser')
for element in soup.select('p'):
    print(element)

全部打印出来

<p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7218" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt"><span style="font-weight:bold">Test1###yrdy</span></p>
<p id="p:659babcd-9de3-0e7a-27ba-7fa0325a40f7220" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test2###qweqwe</p>
<p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b22" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt"><span style="color:red">Test3 </span> ###qweqeqwe</p>
<p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b17" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test4 ### sfsfsdfds</p>
<p id="p:d59b11dc-654f-0d5c-0ee2-f66181a6fa4b19" lang="en-US" style="font-size:10.5pt;margin-top:0pt;margin-bottom:0pt">Test5### 121212</p>

Process finished with exit code 0

【讨论】：

soup.select() 没有text= 参数，所以它会忽略它并打印所有<p> 标签

以上是关于为啥在搜索文本是不是包含字符串时，beautifoulsount find_all 缺少此元素？的主要内容，如果未能解决你的问题，请参考以下文章