在 BeautifulSoup 中提取带有换行符的文本
Posted
技术标签:
【中文标题】在 BeautifulSoup 中提取带有换行符的文本【英文标题】:Extract text with line break in BeautifulSoup 【发布时间】:2019-04-08 06:54:18 【问题描述】:我想用 BeautifulSoup 提取带有换行符和“br”标签的文本。
html = "<td class="s4 softmerge" dir="ltr"><div class="softmerge-inner" style="width: 5524px; left: -1px;">But when he saw many of the Pharisees and Sadducees come to his baptism, he said unto them, <br/>O generation of vipers, who hath warned you to flee from the wrath to come?<br/>Bring forth therefore fruits meet for repentance:<br/>And think not to say within yourselves, We have Abraham to our father: for I say unto you, that God is able of these stones to raise up children unto Abraham.<br/>And now also the axe is laid unto the root of the trees: therefore every tree which bringeth not forth good fruit is hewn down, and cast into the fire.<br/>I indeed baptize you with water unto repentance. but he that cometh after me is mightier than I, whose shoes I am not worthy to bear: he shall baptize you with the Holy Ghost, and with fire:<br/>Whose fan is in his hand, and he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.</div></td>"
我想在字符串中得到这样的结果;
But when he saw many of the Pharisees and Sadducees come to his baptism, he said unto them,
O generation of vipers, who hath warned you to flee from the wrath to come?
Bring forth therefore fruits meet for repentance:
And think not to say within yourselves, We have Abraham to our father: for I say unto you, that God is able of these stones to raise up children unto Abraham.
And now also the axe is laid unto the root of the trees: therefore every tree which bringeth not forth good fruit is hewn down, and cast into the fire.
I indeed baptize you with water unto repentance. but he that cometh after me is mightier than I, whose shoes I am not worthy to bear: he shall baptize you with the Holy Ghost, and with fire:
Whose fan is in his hand, and he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.
我怎样才能得到这个结果?
【问题讨论】:
【参考方案1】:很抱歉,如果这不是您想要的,但您可以尝试replace
或regex
。
例如,您可以通过创建一个过滤器来使用正则表达式,该过滤器会查找所有 <br>
标记并将它们替换为换行符 (\n
)。
如果你使用的是 BeautifulSoup 对象,我相信你需要使用它的string
属性:html = soupelement.string
。
import re
regex = re.compile(r"<br/?>", re.IGNORECASE) # the filter, it finds <br> tags that may or may not have slashes
html = 'blah blah b<br>lah <br/> bl<br/>'
newtext = re.sub(regex, '\n', html) # replaces matches with the newline
print(newtext)
# Returns 'blah blah b\nlah \n bl\n' !
【讨论】:
感谢您的帮助【参考方案2】:有两种方法可以得到结果
匹配标签内的每个字符串, 查看是否属于NavigableString
代码
soup = BeautifulSoup(html,"lxml")
for ele in soup.find("div",class_="softmerge-inner"):
if isinstance(ele,NavigableString):
print(ele)
print()
result = [ele[1] for ele in re.findall(r"""(<div.*?>|<br.>)(.*?)(?=<\w1,4/>|</\w1,4>)""",html)]
for e in result:
print(e)
【讨论】:
感谢您的帮助【参考方案3】:你可以试试这个
html = '''<p>Hi</p>
<p>how are you </p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')
print(soup.getText())
【讨论】:
以上是关于在 BeautifulSoup 中提取带有换行符的文本的主要内容,如果未能解决你的问题,请参考以下文章