保存和导出 python pandas 数据框的 dtypes 信息

Posted 2023-03-12

技术标签:

【中文标题】保存和导出 python pandas 数据框的 dtypes 信息【英文标题】：Save and export dtypes information of a python pandas dataframe 【发布时间】：2018-10-29 13:11:08 【问题描述】：

我有一个名为 df 的 pandas DataFrame。使用df.dtypes 我可以在屏幕上打印：

arrival_time      object
departure_time    object
drop_off_type      int64
extra             object
pickup_type        int64
stop_headsign     object
stop_id           object
stop_sequence      int64
trip_id           object
dtype: object

我想保存此信息，以便我可以将其与其他数据进行比较、在其他地方进行类型转换等。我想将其保存到本地文件，在另一个程序中数据不能去的其他地方恢复它。但我无法弄清楚如何。显示各种转换的结果。

df.dtypes.to_dict()
'arrival_time': dtype('O'),
 'departure_time': dtype('O'),
 'drop_off_type': dtype('int64'),
 'extra': dtype('O'),
 'pickup_type': dtype('int64'),
 'stop_headsign': dtype('O'),
 'stop_id': dtype('O'),
 'stop_sequence': dtype('int64'),
 'trip_id': dtype('O')
----
df.dtypes.to_json()
'"arrival_time":"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O","departure_time":"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O","drop_off_type":"alignment":4,"byteorder":"=","descr":[["","<i8"]],"flags":0,"isalignedstruct":false,"isnative":true,"kind":"i","name":"int64","ndim":0,"num":9,"str":"<i8","extra":"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O","pickup_type":"alignment":4,"byteorder":"=","descr":[["","<i8"]],"flags":0,"isalignedstruct":false,"isnative":true,"kind":"i","name":"int64","ndim":0,"num":9,"str":"<i8","stop_headsign":"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O","stop_id":"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O","stop_sequence":"alignment":4,"byteorder":"=","descr":[["","<i8"]],"flags":0,"isalignedstruct":false,"isnative":true,"kind":"i","name":"int64","ndim":0,"num":9,"str":"<i8","trip_id":"alignment":4,"byteorder":"|","descr":[["","|O"]],"flags":63,"isalignedstruct":false,"isnative":true,"kind":"O","name":"object","ndim":0,"num":17,"str":"|O"'
----
json.dumps( df.dtypes.to_dict() )
...
TypeError: dtype('O') is not JSON serializable

----
list(xdf.dtypes)
[dtype('O'),
 dtype('O'),
 dtype('int64'),
 dtype('O'),
 dtype('int64'),
 dtype('O'),
 dtype('O'),
 dtype('int64'),
 dtype('O')]

如何保存和导出/归档 pandas DataFrame 的 dtype 信息？

【问题讨论】：

【参考方案1】：

pd.DataFrame.dtypes 返回一个pd.Series 对象。这意味着您可以像操作 Pandas 中的任何常规系列一样操作它：

df = pd.DataFrame('A': [''], 'B': [1.0], 'C': [1], 'D': [True])

res = df.dtypes.to_frame('dtypes').reset_index()

print(res)

  index   dtypes
0     A   object
1     B  float64
2     C    int64
3     D     bool

输出到 csv/excel/pickle

然后，您可以使用通常用于存储数据帧的任何方法，例如 to_csv、to_excel、to_pickle 等。不建议分发 pickle 的注意事项，因为它是版本相关。

输出到 json

如果您希望以字典的形式轻松存储和加载，一种流行的格式是json。如您所见，您需要先转换为str 类型：

import json

# first create dictionary
d = res.set_index('index')['dtypes'].astype(str).to_dict()

with open('types.json', 'w') as f:
    json.dump(d, f)

with open('types.json', 'r') as f:
    data_types = json.load(f)

print(data_types)

'A': 'object', 'B': 'float64', 'C': 'int64', 'D': 'bool'

【讨论】：

谢谢！ df.dtypes.to_frame('dtypes').reset_index() 是我一直在寻找的：一种让“无形”信息“有形”的方法！而 json 正是我打算存储它的方式。另外，感谢您展示了一种不涉及实际数据的方式。谢谢。这行得通。我认为会有一种更方便的方法来做到这一点，因为这似乎是一个常见的用例。对于将 dtypes 系列转换为 json，我认为这更清楚 df.dtypes.apply(lambda x: x.name).to_dict() 如在此答案中看到的 ***.com/questions/41087887/…【参考方案2】：

您可以使用pickle 格式。

# save
df.to_pickle(file_name)

# load
df = pandas.read_pickle(file_name)

这是documentation

【讨论】：

【参考方案3】：

我发现自己将 dtype 信息放在了 CSV 文件的开头。在数据帧之前读取它是微不足道的，这使它相当不错。

示例数据框（无耻地从@jpp's answer复制）：

df = pd.DataFrame('A': [''], 'B': [1.0], 'C': [1], 'D': [True])

要保存，我会这样做：

with open('test.csv', 'wt') as f:
    f.write(',' + ','.join(map(str, r.dtypes)) + '\n')
    r.to_csv(f, line_terminator='\n')

我在这里为索引列添加了额外的逗号，因为我想编写索引。一般来说，您不必这样做。

阅读现在是 4 行而不是 1 行，但可以说更加精确。

with open('test.csv', 'rt') as f:
    types = next(f).rstrip().split(',')[1:]
    columns = next(f).rstrip().split(',')[1:]
    test = pd.read_csv(f, dtype=dict(zip(columns, types)), index_col=0, names=columns)

我在天文数据的目录搜索中遇到了这个问题，其中很多文本字段丢失并且被错误地加载为浮点 NaN。另一种方法是在read_csv 上设置low_memory=False，但这会使它更加隐式而不是显式。

【讨论】：

以上是关于保存和导出 python pandas 数据框的 dtypes 信息的主要内容，如果未能解决你的问题，请参考以下文章