数据清洗之 缺失值处理

Posted wx62c62b36cedf9

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据清洗之 缺失值处理相关的知识,希望对你有一定的参考价值。


缺失值处理

  • 缺失值首先需要根据实际情况定义
  • 可以采取直接删除法
  • 有时候需要使用替换法或者插值法
  • 常用的替换法有均值替换、前向、后向替换和常数替换
import pandas as pd
import numpy as np
import
os.getcwd()
D:\\\\Jupyter\\\\notebook\\\\Python数据清洗实战\\\\数据清洗之数据预处理
os.chdir(D:\\\\Jupyter\\\\notebook\\\\Python数据清洗实战\\\\数据)
df = pd.read_csv(MotorcycleData.csv, encoding=gbk, na_values=Na)
def f(x):
if $ in str(x):
x = str(x).strip($)
x = str(x).replace(,, )
else:
x = str(x).replace(,, )
return float(x)
df[Price] = df[Price].apply(f)
df[Mileage] = df[Mileage].apply(f)
# 计算缺失比例
df.apply(lambda x: sum(x.isnull())/len(x), axis=0)
Condition         0.000000
Condition_Desc 0.778994
Price 0.000000
Location 0.000267
Model_Year 0.000534
Mileage 0.003470
Exterior_Color 0.095422
Make 0.000534
Warranty 0.318297
Model 0.016415
Sub_Model 0.676231
Type 0.197785
Vehicle_Title 0.964233
OBO 0.008808
Feedback_Perc 0.117710
Watch_Count 0.530629
N_Reviews 0.000801
Seller_Status 0.083411
Vehicle_Tile 0.007207
Auction 0.002269
Buy_Now 0.031630
Bid_Count 0.707727
dtype: float64
df.head(3)



Condition

Condition_Desc

Price

Location

Model_Year

Mileage

Exterior_Color

Make

Warranty

Model

...

Vehicle_Title

OBO

Feedback_Perc

Watch_Count

N_Reviews

Seller_Status

Vehicle_Tile

Auction

Buy_Now

Bid_Count

0

Used

mint!!! very low miles

11412.0

McHenry, Illinois, United States

2013.0

16000.0

Black

Harley-Davidson

Unspecified

Touring

...

NaN

FALSE

8.1

NaN

2427

Private Seller

Clear

True

FALSE

28.0

1

Used

Perfect condition

17200.0

Fort Recovery, Ohio, United States

2016.0

60.0

Black

Harley-Davidson

Vehicle has an existing warranty

Touring

...

NaN

FALSE

100

17

657

Private Seller

Clear

True

TRUE

0.0

2

Used

NaN

3872.0

Chicago, Illinois, United States

1970.0

25763.0

Silver/Blue

BMW

Vehicle does NOT have an existing warranty

R-Series

...

NaN

FALSE

100

NaN

136

NaN

Clear

True

FALSE

26.0

3 rows × 22 columns

# how = all, 只有当前行都是缺失值才删除
# how = any, 只要当前行有一个缺失值就删除
df.dropna(how = any, axis=0)



Condition

Condition_Desc

Price

Location

Model_Year

Mileage

Exterior_Color

Make

Warranty

Model

...

Vehicle_Title

OBO

Feedback_Perc

Watch_Count

N_Reviews

Seller_Status

Vehicle_Tile

Auction

Buy_Now

Bid_Count

0 rows × 22 columns

# subset 根据指定字段判断
# df.dropna(how=any, subset=[Condition, Price, Mileage])
# 缺失值使用0填补
df.fillna(0).head(5)



Condition

Condition_Desc

Price

Location

Model_Year

Mileage

Exterior_Color

Make

Warranty

Model

...

Vehicle_Title

OBO

Feedback_Perc

Watch_Count

N_Reviews

Seller_Status

Vehicle_Tile

Auction

Buy_Now

Bid_Count

0

Used

mint!!! very low miles

11412.0

McHenry, Illinois, United States

2013.0

16000.0

Black

Harley-Davidson

Unspecified

Touring

...

0

FALSE

8.1

0

2427

Private Seller

Clear

True

FALSE

28.0

1

Used

Perfect condition

17200.0

Fort Recovery, Ohio, United States

2016.0

60.0

Black

Harley-Davidson

Vehicle has an existing warranty

Touring

...

0

FALSE

100

17

657

Private Seller

Clear

True

TRUE

0.0

2

Used

0

3872.0

Chicago, Illinois, United States

1970.0

25763.0

Silver/Blue

BMW

Vehicle does NOT have an existing warranty

R-Series

...

0

FALSE

100

0

136

0

Clear

True

FALSE

26.0

3

Used

CLEAN TITLE READY TO RIDE HOME

6575.0

Green Bay, Wisconsin, United States

2009.0

33142.0

Red

Harley-Davidson

0

Touring

...

0

FALSE

100

0

2920

Dealer

Clear

True

FALSE

11.0

数据清洗之 重复值处理

python大数据挖掘系列之淘宝商城数据预处理实战

数据预处理之清洗

python数据挖掘分析清洗——缺失值处理方法汇总

pandas(12):数据清洗(缺失值)

pandas 缺失数据处理大全

(c)2006-2024 SYSTEM All Rights Reserved IT常识