将 HTML 文件的内容提取到字符串中的最佳方法是啥？（在 Python 中）[重复]

Posted 2023-02-24

技术标签:

【中文标题】将 HTML 文件的内容提取到字符串中的最佳方法是啥？（在 Python 中）[重复]【英文标题】：What is the best way to extract the content of an HTML file into a String? (in Python) [duplicate]将 HTML 文件的内容提取到字符串中的最佳方法是什么？（在 Python 中）[重复] 【发布时间】：2021-08-20 01:38:03 【问题描述】：

如何将您在网页上看到的内容/内容提取为字符串例如转这个：

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>myWebpage</title>
</head>
<body>
    <p>this</p>
    <p>is</p>
    <p>an</p>
    <p>example</p>
</body>
</html>

变成这样的字符串：

this is an example

【问题讨论】：

到目前为止你有什么尝试？使用BeautifulSoup包。 【参考方案1】：

您可以使用 selenium，在此处找到文档：https://pypi.org/project/selenium/

【讨论】：

【参考方案2】：

这个程序做你想做的事：https://github.com/Alir3z4/html2text

你也可以试试：

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

这例如从这个网页中提取文本。

【讨论】：

以上是关于将 HTML 文件的内容提取到字符串中的最佳方法是啥？（在 Python 中）[重复]的主要内容，如果未能解决你的问题，请参考以下文章

将 HTML 文件的内容提取到字符串中的最佳方法是啥？ （在 Python 中）[重复]

将 HTML 文件的内容提取到字符串中的最佳方法是啥？（在 Python 中）[重复]