在 Pandas 中更改列的数据类型

Posted 2020-11-10 xinet

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了在 Pandas 中更改列的数据类型相关的知识，希望对你有一定的参考价值。

import pandas as pd
import numpy as np
a = [[‘a‘, ‘1.2‘, ‘4.2‘], [‘b‘, ‘70‘, ‘0.03‘], [‘x‘, ‘5‘, ‘0‘]]
df = pd.DataFrame(a)

df.dtypes

0    object
1    object
2    object
dtype: object

数据框（data.frame）是最常用的数据结构，用于存储二维表（即关系表）的数据，每一列存储的数据类型必须相同，不同数据列的数据类型可以相同，也可以不同，但是每列的行数（长度）必须相同。数据框的每列都有唯一的名字，在已创建的数据框上，用户可以添加计算列。

1 创建 DataFrame 时指定类型

如果要创建一个 DataFrame，可以直接通过 dtype 参数指定类型：

 df = pd.DataFrame(data=np.arange(100).reshape((10,10)), dtype=np.int8) 
df.dtypes

0    int8
1    int8
2    int8
3    int8
4    int8
5    int8
6    int8
7    int8
8    int8
9    int8
dtype: object

2 对于 `Series`

s = pd.Series([‘1‘, ‘2‘, ‘4.7‘, ‘pandas‘, ‘10‘])
s

0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

使用 `to_numeric` 转为数值

默认情况下，它不能处理字母型的字符串‘pandas‘

pd.to_numeric(s) # or pd.to_numeric(s, errors=‘raise‘);

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()


ValueError: Unable to parse string "pandas"


During handling of the above exception, another exception occurred:


ValueError                                Traceback (most recent call last)

<ipython-input-24-12f1203e2645> in <module>()
----> 1 pd.to_numeric(s) # or pd.to_numeric(s, errors=‘raise‘);


C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\lib\site-packages\pandas\core\tools\numeric.py in to_numeric(arg, errors, downcast)
    131             coerce_numeric = False if errors in (‘ignore‘, ‘raise‘) else True
    132             values = lib.maybe_convert_numeric(values, set(),
--> 133                                                coerce_numeric=coerce_numeric)
    134 
    135     except Exception:


pandas/_libs/src/inference.pyx in pandas._libs.lib.maybe_convert_numeric()


ValueError: Unable to parse string "pandas" at position 3

可以将无效值强制转换为NaN，如下所示：

pd.to_numeric(s, errors=‘coerce‘)

0     1.0
1     2.0
2     4.7
3     NaN
4    10.0
dtype: float64

如果遇到无效值，第三个选项就是忽略该操作：

pd.to_numeric(s, errors=‘ignore‘)

0         1
1         2
2       4.7
3    pandas
4        10
dtype: object

3 对于多列或者整个 DataFrame

如果想要将这个操作应用到多个列，依次处理每一列是非常繁琐的，所以可以使用 DataFrame.apply 处理每一列。

a = [[‘a‘, ‘1.2‘, ‘4.2‘], [‘b‘, ‘70‘, ‘0.03‘], [‘x‘, ‘5‘, ‘0‘]]
df = pd.DataFrame(a, columns=[‘col1‘,‘col2‘,‘col3‘])
df

	col1	col2	col3
0	a	1.2	4.2
1	b	70	0.03
2	x	5	0

df[[‘col2‘,‘col3‘]] = df[[‘col2‘,‘col3‘]].apply(pd.to_numeric)
df.dtypes

col1     object
col2    float64
col3    float64
dtype: object

这里「col2」和「col3」根据需要具有 float64 类型

`df.apply(pd.to_numeric, errors=‘ignore‘)`

该函数将被应用于整个DataFrame，可以转换为数字类型的列将被转换，而不能(例如，它们包含非数字字符串或日期)的列将被单独保留。

另外 `pd.to_datetime` 和 `pd.to_timedelta` 可将数据转换为日期和时间戳。

软转换——类型自动推断

infer_objects() 方法，用于将具有对象数据类型的 DataFrame 的列转换为更具体的类型。

df = pd.DataFrame({‘a‘: [7, 1, 5], ‘b‘: [‘3‘,‘2‘,‘1‘]}, dtype=‘object‘)
df.dtypes

a    object
b    object
dtype: object

然后使用 infer_objects()，可以将列 ‘a‘ 的类型更改为 int64：

df = df.infer_objects()
df.dtypes

a     int64
b    object
dtype: object

`astype` 强制转换

如果试图强制将两列转换为整数类型，可以使用 df.astype(int)。

a = [[‘a‘, ‘1.2‘, ‘4.2‘], [‘b‘, ‘70‘, ‘0.03‘], [‘x‘, ‘5‘, ‘0‘]]
df = pd.DataFrame(a, columns=[‘one‘, ‘two‘, ‘three‘])
df.dtypes

one      object
two      object
three    object
dtype: object

df[[‘two‘, ‘three‘]] = df[[‘two‘, ‘three‘]].astype(float)
df.dtypes

one       object
two      float64
three    float64
dtype: object

以上是关于在 Pandas 中更改列的数据类型的主要内容，如果未能解决你的问题，请参考以下文章

在 Pandas 中更改列的数据类型

1 创建 DataFrame 时指定类型

2 对于 Series

使用 to_numeric 转为数值