数据清洗之 重复值处理
Posted wx62c62b36cedf9
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了数据清洗之 重复值处理相关的知识,希望对你有一定的参考价值。
重复值处理
- 数据清洗一般先从重复值和缺失值开始处理
- 重复值一般采取删除法来处理
- 但有些重复值不能删除,例如订单明细数据或交易明细数据等
import pandas as pd
import numpy as np
import
os.getcwd()
D:\\\\Jupyter\\\\notebook\\\\Python数据清洗实战\\\\数据清洗之数据预处理
os.chdir(D:\\\\Jupyter\\\\notebook\\\\Python数据清洗实战\\\\数据)
df = pd.read_csv(MotorcycleData.csv, encoding=gbk, na_values=Na)
df.head(5)
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
0 | Used | mint!!! very low miles | $11,412 | McHenry, Illinois, United States | 2013.0 | 16,000 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | $17,200 | Fort Recovery, Ohio, United States | 2016.0 | 60 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | NaN | $3,872 | Chicago, Illinois, United States | 1970.0 | 25,763 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
3 | Used | CLEAN TITLE READY TO RIDE HOME | $6,575 | Green Bay, Wisconsin, United States | 2009.0 | 33,142 | Red | Harley-Davidson | NaN | Touring | ... | NaN | FALSE | 100 | NaN | 2920 | Dealer | Clear | True | FALSE | 11.0 |
4 | Used | NaN | $10,000 | West Bend, Wisconsin, United States | 2012.0 | 17,800 | Blue | Harley-Davidson | NO WARRANTY | Touring | ... | NaN | FALSE | 100 | 13 | 271 | OWNER | Clear | True | TRUE | 0.0 |
5 rows × 22 columns
def f(x):
if $ in str(x):
x = str(x).strip($)
x = str(x).replace(,, )
else:
x = str(x).replace(,, )
return float(x)
df[Price] = df[Price].apply(f)
df[Mileage] = df[Mileage].apply(f)
df.head(5)
Condition | Condition_Desc | Price | Location | Model_Year | Mileage | Exterior_Color | Make | Warranty | Model | ... | Vehicle_Title | OBO | Feedback_Perc | Watch_Count | N_Reviews | Seller_Status | Vehicle_Tile | Auction | Buy_Now | Bid_Count | |
0 | Used | mint!!! very low miles | 11412.0 | McHenry, Illinois, United States | 2013.0 | 16000.0 | Black | Harley-Davidson | Unspecified | Touring | ... | NaN | FALSE | 8.1 | NaN | 2427 | Private Seller | Clear | True | FALSE | 28.0 |
1 | Used | Perfect condition | 17200.0 | Fort Recovery, Ohio, United States | 2016.0 | 60.0 | Black | Harley-Davidson | Vehicle has an existing warranty | Touring | ... | NaN | FALSE | 100 | 17 | 657 | Private Seller | Clear | True | TRUE | 0.0 |
2 | Used | NaN | 3872.0 | Chicago, Illinois, United States | 1970.0 | 25763.0 | Silver/Blue | BMW | Vehicle does NOT have an existing warranty | R-Series | ... | NaN | FALSE | 100 | NaN | 136 | NaN | Clear | True | FALSE | 26.0 |
3 | Used | CLEAN TITLE READY TO RIDE HOME | 6575.0 | Green Bay, Wisconsin, United States | 2009.0 | 33142.0 | Red | Harley-Davidson | NaN | Touring | ... | NaN | FALSE | 100 | 大数据项目2(数据挖掘之数据预处理相关概念) |