数据清洗之 缺失值处理
Posted wx62c62b36cedf9
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据清洗之 缺失值处理相关的知识,希望对你有一定的参考价值。
缺失值处理
- 缺失值首先需要根据实际情况定义
- 可以采取直接删除法
- 有时候需要使用替换法或者插值法
- 常用的替换法有均值替换、前向、后向替换和常数替换
import pandas as pd
import numpy as np
import
os.getcwd()
D:\\\\Jupyter\\\\notebook\\\\Python数据清洗实战\\\\数据清洗之数据预处理
os.chdir(D:\\\\Jupyter\\\\notebook\\\\Python数据清洗实战\\\\数据)
df = pd.read_csv(MotorcycleData.csv, encoding=gbk, na_values=Na)
def f(x):
if $ in str(x):
x = str(x).strip($)
x = str(x).replace(,, )
else:
x = str(x).replace(,, )
return float(x)
df[Price] = df[Price].apply(f)
df[Mileage] = df[Mileage].apply(f)
# 计算缺失比例
df.apply(lambda x: sum(x.isnull())/len(x), axis=0)
Condition 0.000000
Condition_Desc 0.778994
Price 0.000000
Location 0.000267
Model_Year 0.000534
Mileage 0.003470
Exterior_Color 0.095422
Make 0.000534
Warranty 0.318297
Model 0.016415
Sub_Model 0.676231
Type 0.197785
Vehicle_Title 0.964233
OBO 0.008808
Feedback_Perc 0.117710
Watch_Count 0.530629
N_Reviews 0.000801
Seller_Status 0.083411
Vehicle_Tile 0.007207
Auction 0.002269
Buy_Now 0.031630
Bid_Count 0.707727
dtype: float64
df.head(3)
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | NaN | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
3 rows × 22 columns
# how = all, 只有当前行都是缺失值才删除
# how = any, 只要当前行有一个缺失值就删除
df.dropna(how = any, axis=0)
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count |
0 rows × 22 columns
# subset 根据指定字段判断
# df.dropna(how=any, subset=[Condition, Price, Mileage])
# 缺失值使用0填补
df.fillna(0).head(5)
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | 0 | FALSE | 8.1 | 0 | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | 0 | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | 0 | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | 0 | FALSE | 100 | 0 | 136 | 0 | Clear | True | FALSE | 26.0 |
3 | Used | CLEAN TITLE READY TO RIDE HOME | 6575.0 | Green Bay, Wisconsin, United States | 2009.0 | 33142.0 | Red | Harley-Davidson | 0 | Touring | ... | 0 | FALSE | 100 | 0 | 2920 | Dealer | Clear | True | FALSE | 11.0 |
数据清洗之 重复值处理 |