使用 BeautifulSoup 在 HTML 中搜索字符串

Posted 2023-02-16

技术标签:

【中文标题】使用 BeautifulSoup 在 HTML 中搜索字符串【英文标题】：Using BeautifulSoup to search HTML for string 【发布时间】：2012-02-14 16:49:21 【问题描述】：

我正在使用 BeautifulSoup 在特定页面上查找用户输入的字符串。例如，我想查看字符串'Python'是否位于页面上：http://python.org

当我使用时： find_string = soup.body.findAll(text='Python'), find_string返回[]

但是当我使用时： find_string = soup.body.findAll(text=re.compile('Python'), limit=1), find_string 按预期返回 [u'Python Jobs']

当要搜索的单词有多个实例时，这两个语句的区别是什么使第二个语句起作用？

【问题讨论】：

【参考方案1】：

我没有使用 BeuatifulSoup，但也许以下内容可以提供一些帮助。

import re
import urllib2
stuff = urllib2.urlopen(your_url_goes_here).read()  # stuff will contain the *entire* page

# Replace the string Python with your desired regex
results = re.findall('(Python)',stuff)

for i in results:
    print i

我并不是说这是一个替代品，但也许你可以在这个概念中收集一些价值，直到一个直接的答案出现。

【讨论】：

Google 员工请参阅 ***.com/questions/34475051/… 以获取现代更新。【参考方案2】：

text='Python' 搜索具有您提供的确切文本的元素：

import re
from BeautifulSoup import BeautifulSoup

html = """<p>exact text</p>
   <p>almost exact text</p>"""
soup = BeautifulSoup(html)
print soup(text='exact text')
print soup(text=re.compile('exact text'))

输出

[u'exact text']
[u'exact text', u'almost exact text']

"查看字符串'Python'是否位于页面http://python.org"：

import urllib2
html = urllib2.urlopen('http://python.org').read()
print 'Python' in html # -> True

如果您需要在字符串中查找子字符串的位置，您可以使用html.find('Python')。

【讨论】：

是否有可能找到所有出现的字符串 Python，而不仅仅是一个？ @Timo ***.com/questions/4664850/… [m.start() for m in re.finditer('test',soup')] ?我迷路了.. @Timo 从the accepted answer to the *** question I've linked 复制代码。确保代码片段在您的环境中工作。开始将其更改为您的任务（一次一个简单的更改）。一旦它坏了（当它为你做了一些意想不到的事情时），将它用作the minimal reproducible code example to ask a new *** question【参考方案3】：

以下行正在寻找 exact NavigableString 'Python'：

>>> soup.body.findAll(text='Python')
[]

请注意，找到了以下 NavigableString：

>>> soup.body.findAll(text='Python Jobs') 
[u'Python Jobs']

注意这种行为：

>>> import re
>>> soup.body.findAll(text=re.compile('^Python$'))
[]

因此，您的正则表达式正在寻找与 NavigableString 'Python' 不完全匹配的“Python”。

【讨论】：

是否可以获取特定文本的父标签？ @Samay soup.find(text='Python Jobs').parent — 来自文档："Going up"【参考方案4】：

除了accepted answer。您可以使用lambda 代替regex：

from bs4 import BeautifulSoup

html = """<p>test python</p>"""

soup = BeautifulSoup(html, "html.parser")

print(soup(text="python"))
print(soup(text=lambda t: "python" in t))

输出：

[]
['test python']

【讨论】：

以上是关于使用 BeautifulSoup 在 HTML 中搜索字符串的主要内容，如果未能解决你的问题，请参考以下文章