Pandas json_normalize 不会展平所有嵌套字段

Posted

技术标签:

【中文标题】Pandas json_normalize 不会展平所有嵌套字段【英文标题】:Pandas json_normalize does not flatten all nested fields 【发布时间】:2019-06-12 01:17:39 【问题描述】:

我正在分析一个 json 文件,我想将嵌套的 json 输入文件转换为 python 中的平面数据框。有没有管理这个的python方法?还是我应该创建一个自定义函数来做到这一点?你能举个例子来解决这个问题吗?

我已经尝试过函数 json_normalize 并且我也尝试过另一种解决方案:嵌套 for 语句在每个嵌套级别逐个元素检索

d =  pd.read_json('test 1.json', lines=True)
from pandas.io.json import json_normalize
d2=json_normalize(d['track])

我尝试过的第二个选项:

for index, row in d.iterrows():
  for element in row['track']:
    if element == "features":
        print(row['track']['features'])

json文件内容:

 "_id" :  "$oid" : "5b9058462f38434ab0d85cd3" , "user_day_code" : "ead1db07fa526e19fe237115d5516fbdc5acb99057b885e8f662a147990b3c4b", "idplug_base" : 5, "track" :  "type" : "FeatureCollection", "features" : [  "geometry" :  "type" : "Point", "coordinates" : [ -3.7073786, 40.4237274997222 ] , "type" : "Feature", "properties" :  "var" : "28015,ES,Madrid,Madrid,CALLE SAN BERNARDO 38,Madrid", "speed" : 1.75, "secondsfromstart" : 205  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.709896, 40.4191897997222 ] , "type" : "Feature", "properties" :  "var" : "28013,ES,Madrid,Madrid,CUSTA SANTO DOMINGO 6,Madrid", "speed" : 4.63, "secondsfromstart" : 85   ] , "user_type" : 1, "idunplug_base" : 17, "travel_time" : 263, "idunplug_station" : 40, "ageRange" : 0, "idplug_station" : 16, "unplug_hourTime" :  "$date" : "2018-09-01T01:00:00.000+0200" , "zip_code" : "" 
 "_id" :  "$oid" : "5b9058462f38434ab0d85ce9" , "user_day_code" : "420d9e220bd8816681162e15e9afcb1c69c5a756090728701083c5c0b23502f2", "idplug_base" : 12, "track" :  "type" : "FeatureCollection", "features" : [  "geometry" :  "type" : "Point", "coordinates" : [ -3.7022001, 40.4052982997222 ] , "type" : "Feature", "properties" :  "var" : "28012,ES,Madrid,Madrid,GTA EMBAJADORES,Madrid", "speed" : 0.33, "secondsfromstart" : 351  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.698618, 40.4061700997222 ] , "type" : "Feature", "properties" :  "var" : "28012,ES,Madrid,Madrid,RONDA ATOCHA 30,Madrid", "speed" : 6.36, "secondsfromstart" : 291  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.6949231, 40.4072785997222 ] , "type" : "Feature", "properties" :  "var" : "28012,ES,Madrid,Madrid,RONDA ATOCHA,Madrid", "speed" : 4.77, "secondsfromstart" : 231  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.6920543, 40.4081501 ] , "type" : "Feature", "properties" :  "var" : "28012,ES,Madrid,Madrid,PLAZA EMPERADOR CARLOS V 1,Madrid", "speed" : 4.38, "secondsfromstart" : 170   ] , "user_type" : 1, "idunplug_base" : 26, "travel_time" : 382, "idunplug_station" : 85, "ageRange" : 2, "idplug_station" : 52, "unplug_hourTime" :  "$date" : "2018-09-01T01:00:00.000+0200" , "zip_code" : "28009" 
 "_id" :  "$oid" : "5b9058462f38434ab0d85ced" , "user_day_code" : "780f5c8157efe8e6dca44dbd689817d4b126364fca917f0e668bad9e7bf96939", "idplug_base" : 1, "track" :  "type" : "FeatureCollection", "features" : [  "geometry" :  "type" : "Point", "coordinates" : [ -3.69610249972222, 40.427829 ] , "type" : "Feature", "properties" :  "var" : "28004,ES,Madrid,Madrid,PLAZA ALONSO MARTINEZ,Madrid", "speed" : 6.22, "secondsfromstart" : 200  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.69482799972222, 40.4282634997222 ] , "type" : "Feature", "properties" :  "var" : "28010,ES,Madrid,Madrid,CALLE FERNANDO EL SANTO 4,Madrid", "speed" : 0, "secondsfromstart" : 140  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.69164359972222, 40.4280088 ] , "type" : "Feature", "properties" :  "var" : "28010,ES,Madrid,Madrid,CALLE FERNANDO EL SANTO 20,Madrid", "speed" : 5.05, "secondsfromstart" : 80   ] , "user_type" : 1, "idunplug_base" : 11, "travel_time" : 305, "idunplug_station" : 109, "ageRange" : 4, "idplug_station" : 58, "unplug_hourTime" :  "$date" : "2018-09-01T01:00:00.000+0200" , "zip_code" : "28004" 
 "_id" :  "$oid" : "5b9058462f38434ab0d85cee" , "user_day_code" : "a225ab7b4b74954cd9fbe8cc2ec63390cd04e92cdd1a2fe1e58d42faea082b21", "idplug_base" : 1, "track" :  "type" : "FeatureCollection", "features" : [  "geometry" :  "type" : "Point", "coordinates" : [ -3.72050759972222, 40.4277548 ] , "type" : "Feature", "properties" :  "var" : "28008,ES,Madrid,Madrid,PASEO PINTOR ROSALES 49P,Madrid", "speed" : 0.86, "secondsfromstart" : 258  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.717881, 40.4274713 ] , "type" : "Feature", "properties" :  "var" : "28008,ES,Madrid,Madrid,CALLE QUINTANA 17,Madrid", "speed" : 6.75, "secondsfromstart" : 199  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.7142441, 40.4297779997222 ] , "type" : "Feature", "properties" :  "var" : "28015,ES,Madrid,Madrid,CALLE SERRANO JOVER 4D,Madrid", "speed" : 7.08, "secondsfromstart" : 139  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.71240559972222, 40.4341422997222 ] , "type" : "Feature", "properties" :  "var" : "28015,ES,Madrid,Madrid,CALLE FERNANDO EL CATOLICO 47A,Madrid", "speed" : 5.25, "secondsfromstart" : 79  ,  "geometry" :  "type" : "Point", "coordinates" : [ -3.7089558, 40.4340593 ] , "type" : "Feature", "properties" :  "var" : "28015,ES,Madrid,Madrid,CALLE FERNANDO EL CATOLICO 21,Madrid", "speed" : 5.61, "secondsfromstart" : 19   ] , "user_type" : 1, "idunplug_base" : 1, "travel_time" : 262, "idunplug_station" : 168, "ageRange" : 4, "idplug_station" : 120, "unplug_hourTime" :  "$date" : "2018-09-01T01:00:00.000+0200" , "zip_code" : "28015" 

实际结果: 选项 1:不起作用并且数据框保持嵌套。 选项2:很复杂的方式

预期结果: 包含初始 json 的所有元素的平面数据框。

预期的平面数据框示例:

_id                      user_day_code                                                     idplug_base  track coordinates                  var                                            speed  secondsfromstart   user_type  idunplug_base ...
5b9058462f38434ab0d85ce9 420d9e220bd8816681162e15e9afcb1c69c5a756090728701083c5c0b23502f2  12           1     -3.7022001, 40.4052982997222 28012,ES,Madrid,Madrid,GTA EMBAJADORES,Madrid  0.33   351                1          26            ...
5b9058462f38434ab0d85ce9 420d9e220bd8816681162e15e9afcb1c69c5a756090728701083c5c0b23502f2  12           2      -3.698618, 40.4061700997222 28012,ES,Madrid,Madrid,RONDA ATOCHA 30,Madrid  6.36   291                1          26            ...

...

【问题讨论】:

查看 JSON 字符串,很难说出您期望的数据框形状。请说明您的预期输出 包含的预期输出示例。 【参考方案1】:

您有一个 JSON 行文件。将其作为字典列表读入,然后调用json_normalize。您需要自己进行一定程度的取消嵌套。

def update(a, b):
    a.update(b)
    return a

l = pd.read_json('test 1.json', lines=True).to_dict('r')
json_normalize([update(y, x) for x in l for y in x.pop('track')['features']])

首先,使用 pd.read_jsonlines=True 参数读取 JSON 行文件。使用to_dict(orient='records') 将数据框重新转换为字典列表。

l = pd.read_json('test 1.json', lines=True).to_dict('r')

接下来,对于l 中的每个子列表x,取消嵌套x['tracks'] 中的数据及其元数据。

例如,

import copy 

dct = copy.deepcopy(l[0])
x = dct.pop('track')['features'][0]
r = **x, **dct   
# '_id': '$oid': '5b9058462f38434ab0d85cd3',
#  'ageRange': 0,
#  'geometry': 'coordinates': [-3.7073786, 40.4237274997222], 'type': 'Point',
#  'idplug_base': 5,
#  'idplug_station': 16,
#   ...
#  'user_day_code': 'ead1db07fa526e19fe237115d5516fbdc5acb99057b885e8f662a147990b3c4b',
#  'user_type': 1,
#  'zip_code': ''

我们生成这些扁平子字典的列表,json_normalize 可以处理其余部分:

json_normalize([r])

                   _id.$oid  ageRange   ...    user_type zip_code
0  5b9058462f38434ab0d85cd3         0   ...            1         

json_normalize([r]).iloc[0].T

_id.$oid                                                5b9058462f38434ab0d85cd3
ageRange                                                                       0
geometry.coordinates                              [-3.7073786, 40.4237274997222]
geometry.type                                                              Point
...
user_day_code                  ead1db07fa526e19fe237115d5516fbdc5acb99057b885...
user_type                                                                      1
zip_code                                                                        
Name: 0, dtype: object

【讨论】:

已测试但无法正常工作:检索到错误:文件“”,第 5 行 json_normalize([**y, **x for x in l for y in x.pop('track')['features']]) ^ SyntaxError: invalid syntax 当我标记这个问题时,我正在寻找 python 2.7.x 中的解决方案。客户要求。 工作。是否可以对提议的代码进行分类解释?基本上传给json_normalize的参数是什么意思? @Carlos 编辑了一些更容易理解的例子。

以上是关于Pandas json_normalize 不会展平所有嵌套字段的主要内容,如果未能解决你的问题,请参考以下文章

如何防止 json_normalize 在 Pandas 中重复列标题?

pandas json_normalize KeyError

Pandas json_normalize 会产生令人困惑的“KeyError”消息?

Pandas json_normalize 返回 KeyError

Pandas json_normalize 不会展平所有嵌套字段

Pandas json_normalize 无法在 Python 中使用大型 JSON 文件