翻译: Pandas Pytorch 数据预处理

Posted 2022-03-12 AI架构师易筋

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了翻译: Pandas Pytorch 数据预处理相关的知识，希望对你有一定的参考价值。

到目前为止，我们已经介绍了多种技术来处理已经存储在张量中的数据。为了将深度学习应用于解决现实世界的问题，我们通常从预处理原始数据开始，而不是那些精心准备的张量格式数据。在 Python 中流行的数据分析工具中，pandas包是常用的。与庞大的 Python 生态系统中的许多其他扩展包一样， pandas可以与张量一起使用。因此，我们将简要介绍预处理原始数据pandas并将其转换为张量格式的步骤。我们将在后面的章节中介绍更多的数据预处理技术。

2.2.1 读取数据集

例如，我们首先创建一个人工数据集，该数据集存储在 csv（逗号分隔值）文件…/data/house_tiny.csv中。以其他格式存储的数据可以以类似的方式处理。

下面我们将数据集逐行写入一个csv文件。

import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\\n')  # Column names
    f.write('NA,Pave,127500\\n')  # Each row represents a data example
    f.write('2,NA,106000\\n')
    f.write('4,NA,178100\\n')
    f.write('NA,NA,140000\\n')

为了从创建的 csv 文件加载原始数据集，我们导入 pandas包并调用read_csv函数。该数据集有四行三列，其中每一行描述了房间的数量（“NumRooms”）、小巷类型（“Alley”）和房子的价格（“Price”）。

# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN  140000

2.2.2 处理缺失数据

请注意，“NaN”条目是缺失值。为了处理缺失数据，典型的方法包括插补和删除，其中插补用替换值替换缺失值，而删除忽略缺失值。在这里，我们将考虑插补。

iloc通过基于整数位置data的索引inputs（outputs对于缺失的数值 inputs，我们将“NaN”条目替换为同一列的平均值。

inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
print(inputs)
print(outputs)
inputs = inputs.fillna(inputs.mean())
print(inputs)

   NumRooms Alley
0       NaN  Pave
1       2.0   NaN
2       4.0   NaN
3       NaN   NaN
0    127500
1    106000
2    178100
3     14000
Name: Price, dtype: int64
   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0   NaN
3       3.0   NaN

对于中的分类或离散值inputs，我们将“NaN”视为一个类别。由于“Alley”列只取“Pave”和“NaN”两种分类值，pandas可以自动将该列转换为“Alley_Pave”和“Alley_nan”两列。巷道类型为“Pave”的行会将“Alley_Pave”和“Alley_nan”的值设置为 1 和 0。缺少巷道类型的行会将其值设置为 0 和 1。

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

2.2.3 转换为张量格式

现在inputs和中的所有条目outputs都是数字的，它们可以转换为张量格式。一旦数据采用这种格式，就可以使用我们在第 2.1 节中介绍的张量功能进一步处理它们。

import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

(tensor([[3., 1., 0.],
         [2., 0., 1.],
         [4., 0., 1.],
         [3., 0., 1.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

2.2.4 概括

与庞大的 Python 生态系统中的许多其他扩展包一样， pandas可以与张量一起使用。

插补和删除可用于处理缺失数据。

2.2.5 练习

创建具有更多行和列的原始数据集。

删除缺失值最多的列。

print(data)
m = max(data.isnull().sum(axis=0))
print(m)
data_dropmaxnan = data.dropna(axis = 1, thresh = len(data)+1-m)
print(data_dropmaxnan)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0   NaN  178100
3       NaN   NaN   14000
3
   NumRooms   Price
0       NaN  127500
1       2.0  106000
2       4.0  178100
3       NaN   14000

将预处理后的数据集转换为张量格式。

1.The best way to read pytorch’s source code?Please give me some tips.

Here are some official API documents that may be helpful.

2. how to loop by dataframe’s colomns?I’m trying to use loop to calculate data.isnull().sum().

There are a vast amount of tutorials for pandas. You can just search online. Here is the official guide.
https://pandas.pydata.org/docs/user_guide/index.html#user-guide

参考

https://d2l.ai/chapter_preliminaries/pandas.html

以上是关于翻译: Pandas Pytorch 数据预处理的主要内容，如果未能解决你的问题，请参考以下文章