警告：根：某些字符无法解码，并被替换为替换字符。带有请求和 Beastuifulsoup

Posted 2023-02-23

技术标签:

【中文标题】警告：根：某些字符无法解码，并被替换为替换字符。带有请求和 Beastuifulsoup【英文标题】：WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER. With Requests and Beastuifulsoup 【发布时间】：2015-07-18 13:41:22 【问题描述】：

几分钟前我有这个网络抓取代码工作，但现在我收到这个警告和编码。由于此请求不返回 html，因此当我搜索标签的内容时，Beautifulsoup 将返回 None 类型。这里出了什么问题？我试着用谷歌搜索一下这个编码问题，但找不到明确的答案。

import requests
from bs4 import BeautifulSoup


url = 'http://finance.yahoo.com/q?s=aapl&fr=uh3_finance_web&uhb=uhb2'

data = requests.get(url)
soup = BeautifulSoup(data.content).text
print(data)

结果如下：

0.0 seconds
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]>
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
<Response [200]> 


Process finished with exit code 0

【问题讨论】：

先无法重现。第二个soup 在您的情况下是字符串而不是BeautifulSoup 对象 【参考方案1】：

response = urlopen(notiurl)
html = response.read().decode(encoding="iso-8859-1")
soup = BeautifulSoup(html, 'html.parser')

检查编码--->print(soup.original_encoding)

文档---->https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

【讨论】：

【参考方案2】：

Beautifulsoup 下面的构造函数为我工作：

soup = BeautifulSoup(open(html_path, 'r'),"html.parser",from_encoding="iso-8859-1")

【讨论】：

以上是关于警告：根：某些字符无法解码，并被替换为替换字符。带有请求和 Beastuifulsoup的主要内容，如果未能解决你的问题，请参考以下文章

java导出 Excel时，对特殊字符编码后的字符串进行解码

无法用java替换html字符串中的某些文本

字符转码（escape()、encodeURI()、encodeURIComponent()区别详解）

如何将带引号的多字字符串替换为参数？

如何遍历字符串并替换某些短语？

从文件夹中所有文件的文件名中替换或删除某些字符[关闭]