以 unicode 将 pandas DataFrame 写入 JSON
Posted
技术标签:
【中文标题】以 unicode 将 pandas DataFrame 写入 JSON【英文标题】:Writing pandas DataFrame to JSON in unicode 【发布时间】:2017-01-29 10:47:12 【问题描述】:我正在尝试将包含 unicode 的 pandas DataFrame 写入 json,但内置的 .to_json
函数会转义字符。我该如何解决这个问题?
例子:
import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
df.to_json('df.json')
这给出了:
"0":"0":"\u03c4","1":"\u03c0","1":"0":"a","1":"b","2":"0":1,"1":2
与预期结果不同:
"0":"0":"τ","1":"π","1":"0":"a","1":"b","2":"0":1,"1":2
我尝试添加
force_ascii=False
参数:
import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
df.to_json('df.json', force_ascii=False)
但这会产生以下错误:
UnicodeEncodeError: 'charmap' codec can't encode character '\u03c4' in position 11: character maps to <undefined>
我正在使用 WinPython 3.4.4.2 64bit 和 pandas 0.18.0
【问题讨论】:
【参考方案1】:还有另一种方法可以做到这一点。因为 JSON 由键(双引号中的字符串)和值(字符串、数字、嵌套的 JSON 或数组)组成,并且因为它与 Python 的字典非常相似,所以您可以使用简单的转换和字符串操作从 Pandas DataFrame 中获取 JSON
import pandas as pd
df = pd.DataFrame([['τ', 'a', 1], ['π', 'b', 2]])
# convert index values to string (when they're something else - JSON requires strings for keys)
df.index = df.index.map(str)
# convert column names to string (when they're something else - JSON requires strings for keys)
df.columns = df.columns.map(str)
# convert DataFrame to dict, dict to string and simply jsonify quotes from single to double quotes
js = str(df.to_dict()).replace("'", '"')
print(js) # print or write to file or return as REST...anything you want
输出:
"0": "0": "τ", "1": "π", "1": "0": "a", "1": "b", "2": "0": 1, "1": 2
更新:
根据@Swier 的注释(谢谢),原始数据框中包含双引号的字符串可能存在问题。 df.jsonify()
会转义它们(即'"a"'
会以 JSON 格式生成 "\\"a\\""
)。借助字符串方法中的小更新也可以处理此问题。完整示例:
import pandas as pd
def run_jsonifier(df):
# convert index values to string (when they're something else)
df.index = df.index.map(str)
# convert column names to string (when they're something else)
df.columns = df.columns.map(str)
# convert DataFrame to dict and dict to string
js = str(df.to_dict())
#store indices of double quote marks in string for later update
idx = [i for i, _ in enumerate(js) if _ == '"']
# jsonify quotes from single to double quotes
js = js.replace("'", '"')
# add \ to original double quotes to make it json-like escape sequence
for add, i in enumerate(idx):
js = js[:i+add] + '\\' + js[i+add:]
return js
# define double-quotes-rich dataframe
df = pd.DataFrame([['τ', '"a"', 1], ['π', 'this" breaks >>"<""< ', 2]])
# run our function to convert dataframe to json
print(run_jsonifier(df))
# run original `to_json()` to see difference
print(df.to_json())
输出:
"0": "0": "τ", "1": "π", "1": "0": "\"a\"", "1": "this\" breaks >>\"<\"\"< ", "2": "0": 1, "1": 2
"0":"0":"\u03c4","1":"\u03c0","1":"0":"\"a\"","1":"this\" breaks >>\"<\"\"< ","2":"0":1,"1":2
【讨论】:
如果任何文本值中有引号,则将结果转换为字符串并替换引号将产生无效的 json。pd.DataFrame([['τ', 'a', 1], ['π', 'this breaks >>"<< ', 2]])
将产生"0": "0": "τ", "1": "π", "1": "0": "a", "1": "this breaks >>"<< ", "2": "0": 1, "1": 2
谢谢@Swier - 我已经更新了解决此类问题的答案【参考方案2】:
打开一个编码设置为 utf-8 的文件,然后将该文件传递给 .to_json
函数可以解决问题:
with open('df.json', 'w', encoding='utf-8') as file:
df.to_json(file, force_ascii=False)
给出正确的:
"0":"0":"τ","1":"π","1":"0":"a","1":"b","2":"0":1,"1":2
注意:它仍然需要 force_ascii=False
参数。
【讨论】:
以上是关于以 unicode 将 pandas DataFrame 写入 JSON的主要内容,如果未能解决你的问题,请参考以下文章
将字符串拆分附加到 Pandas DataFrame [关闭]
pandas - 将 df.index 从 float64 更改为 unicode 或字符串
将 pandas df 写入 csv 时出现 Unicode 编码错误