使用包含嵌套 JSON 字符串的一列解析 Pandas DataFrame 中的列
Posted
技术标签:
【中文标题】使用包含嵌套 JSON 字符串的一列解析 Pandas DataFrame 中的列【英文标题】:Parsing Column in Pandas DataFrame with one column that contains a nested JSON string 【发布时间】:2018-09-26 08:52:23 【问题描述】:我有一个 Python 中的 DataFrame,如下所示。有一列(下面称为“json”),其中包含一个大的嵌套 JSON 字符串。我该如何解析它,以便我可以拥有一个包含许多列的漂亮干净的数据框。只是特别需要单独列中每个 ID 的成本和每月金额。理想情况下,我的表格如下所示:
id、姓名、费用、每月
10001, 弗兰克, 15.85, 15.85
10002, 玛丽, 30.86, 23.03
d = 'id': ['10001', '10002'], 'json': ['"costs":["cost":15.85],"policies":["logo":"HLIF-transparent-inhouse.png","monthly":15.85,"rating":"A++","waiverOfPremium":1.74,"carrier":"companyabc","face":250000,"term":20,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"],"agentSuggestion":"costs":["cost":15.85],"options":"product":"XYZt","gender":"male","healthClass":"0","smoker":"false","age":32,"term":"20","faceAmount":250000,"waiverOfPremiumAmount":1.74,"includeWaiverOfPremium":false,"state":"CT","policies":["logo":"HLIF-transparent-inhouse.png","monthly":15.85,"rating":"A++","waiverOfPremium":1.74,"carrier":"companyabc","face":250000,"term":20,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"]', '"costs":["cost":30.86],"policies":["logo":"HLIF-transparent-inhouse.png","monthly":23.03,"rating":"A++","waiverOfPremium":7.83,"carrier":"companyabc","face":1000000,"term":10,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"],"agentSuggestion":"costs":["cost":30.86],"options":"product":"XYZt","gender":"female","healthClass":"0","smoker":"false","age":35,"term":10,"faceAmount":1000000,"waiverOfPremiumAmount":7.83,"includeWaiverOfPremium":true,"state":"GA","policies":["logo":"HLIF-transparent-inhouse.png","monthly":23.03,"rating":"A++","waiverOfPremium":7.83,"carrier":"companyabc","face":1000000,"term":10,"newFace":null,"newMonthly":null,"isCompanyD":true,"carrierCode":"xyz","product":"XYZt"]'], 'name':['frank','mary']
test = pd.DataFrame(data=d)
【问题讨论】:
【参考方案1】:你去。您的 JSON 中有 2 种不同的成本(成本和 agentSuggestion 成本),因此在此处添加两者:
import json
test = pd.DataFrame(d, columns = ['id', 'json', 'name'])
test['cost'] = test['json'].transform(lambda x: json.loads(x)['costs'][0]['cost'])
test['agent_suggestion_cost'] = test['json']\
.transform(lambda x: json.loads(x)['agentSuggestion']["costs"][0]['cost'])
print(test)
您可以按照类似的逻辑来解析其他字段,例如每月。如需更多参考,请参阅例如here 寻找 JSON 美化器(例如使用 Notepad++ 的JSTool)来查看 JSON 的结构,这将有助于理解其结构。
如果觉得有用,欢迎采纳。
【讨论】:
谢谢!这正是我正在寻找的,它工作得很好。感谢您的帮助!【参考方案2】:Pandas 提供了一些实用程序来处理 json 文件。 对您的情况有意义的是 pd.read_json 和 pd.io.json_normalize。但是,他们确实希望输入的 json 格式与您的不同。
orient : string,
Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is:
'split' : dict like index -> [index], columns -> [columns], data -> [values]
'records' : list like [column -> value, ... , column -> value]
'index' : dict like index -> column -> value
'columns' : dict like column -> index -> value
'values' : just the values array
The allowed and default values depend on the value of the typ parameter.
when typ == 'series',
allowed orients are 'split','records','index'
default is 'index'
The Series index must be unique for orient 'index'.
when typ == 'frame',
allowed orients are 'split','records','index', 'columns','values'
default is 'columns'
The DataFrame index must be unique for orients 'index' and 'columns'.
The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.
【讨论】:
以上是关于使用包含嵌套 JSON 字符串的一列解析 Pandas DataFrame 中的列的主要内容,如果未能解决你的问题,请参考以下文章
取消嵌套存储在列中的 JSON 字符串 [BigQuery]
使用 Pandas 在巨大的 CSV 中解析带有嵌套值的 JSON 列