ValueError: <class 'pandas.core.frame.DataFrame'> 的填充值无效
Posted
技术标签:
【中文标题】ValueError: <class \'pandas.core.frame.DataFrame\'> 的填充值无效【英文标题】:ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>ValueError: <class 'pandas.core.frame.DataFrame'> 的填充值无效 【发布时间】:2017-11-11 01:18:25 【问题描述】:我正在练习贷款预测练习问题,并尝试在我的数据中填充缺失值。我从here 获得数据。为了完成这个问题,我关注了这个tutorial。
您可以在 GitHub 上找到我正在使用的整个代码(文件名 model.py)和数据 here。
DataFrame 如下所示:
df[['Loan_ID', 'Self_Employed', 'Education', 'LoanAmount']].head(10)
Out:
Loan_ID Self_Employed Education LoanAmount
0 LP001002 No Graduate NaN
1 LP001003 No Graduate 128.0
2 LP001005 Yes Graduate 66.0
3 LP001006 No Not Graduate 120.0
4 LP001008 No Graduate 141.0
5 LP001011 Yes Graduate 267.0
6 LP001013 No Not Graduate 95.0
7 LP001014 No Graduate 158.0
8 LP001018 No Graduate 168.0
9 LP001020 No Graduate 349.0
最后一行执行后(对应model.py文件中的第60行)
url = 'https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url)
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)
table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
# Define function to return value of this pivot_table
def fage(x):
return table.loc[x['Self_Employed'],x['Education']]
# Replace missing values
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
我收到此错误:
ValueError Traceback (most recent call last)
<ipython-input-40-5146e49c2460> in <module>()
----> 1 df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
2368 axis=axis, inplace=inplace,
2369 limit=limit, downcast=downcast,
-> 2370 **kwargs)
2371
2372 @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs)
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in fillna(self, value, method, axis, inplace, limit, downcast)
3264 else:
3265 raise ValueError("invalid fill value with a %s" %
-> 3266 type(value))
3267
3268 new_data = self._data.fillna(value=value, limit=limit,
ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>
如何在不出现此错误的情况下填充缺失值?
【问题讨论】:
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()]
这没有意义。您正在寻找空值并试图用空值填充空值?
@ayhan 我按照教程中的方式进行操作,我认为它应该用 true 填充缺失值
对不起,它试图用df[df['LoanAmount'].isnull()].apply(fage, axis=1)
填充你能包括函数 fage 定义和一个小的可重现数据集吗?
@ayhan 我已经给出了我正在使用的整个代码的链接,但以防万一这里是 def fage(x): ...: return table.loc[x['Self_Employed'] ,x['教育']]
@ayhan 至于数据集,它也在我的 github 链接上有问题,数据很小,你可以从那里下载
【参考方案1】:
似乎教程的作者想用table
的值替换NaN
。
但需要先通过unstack
和set_index
创建Series
以对齐数据。
首先删除将NaN
替换为mean
:
url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas
#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)
table = df.pivot_table(values='LoanAmount',
index='Self_Employed',
columns='Education',
aggfunc=np.median)
print (table.unstack())
Education Self_Employed
Graduate No 130.0
Yes 157.5
Not Graduate No 113.0
Yes 130.0
dtype: float64
#check all values with NaN in LoanAmount column
print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate NaN
35 No Graduate NaN
63 No Graduate NaN
81 Yes Graduate NaN
95 No Graduate NaN
102 No Graduate NaN
103 No Graduate NaN
113 Yes Graduate NaN
127 No Graduate NaN
202 No Not Graduate NaN
284 No Graduate NaN
305 No Not Graduate NaN
322 No Not Graduate NaN
338 No Not Graduate NaN
387 No Not Graduate NaN
435 No Graduate NaN
437 No Graduate NaN
479 No Graduate NaN
524 No Graduate NaN
550 Yes Graduate NaN
551 No Not Graduate NaN
605 No Not Graduate NaN
#for check get all indexes where NaNs
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([ 0, 35, 63, 81, 95, 102, 103, 113, 127, 202, 284, 305, 322,
338, 387, 435, 437, 479, 524, 550, 551, 605],
# Replace missing values
df = df.set_index(['Education','Self_Employed'])
df['LoanAmount'].fillna(table.unstack(), inplace=True)
df = df.reset_index()
#check output - filter only indexes where NaNs before
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate 130.0
35 No Graduate 130.0
63 No Graduate 130.0
81 Yes Graduate 157.5
95 No Graduate 130.0
102 No Graduate 130.0
103 No Graduate 130.0
113 Yes Graduate 157.5
127 No Graduate 130.0
202 No Not Graduate 113.0
284 No Graduate 130.0
305 No Not Graduate 113.0
322 No Not Graduate 113.0
338 No Not Graduate 113.0
387 No Not Graduate 113.0
435 No Graduate 130.0
437 No Graduate 130.0
479 No Graduate 130.0
524 No Graduate 130.0
550 Yes Graduate 157.5
551 No Not Graduate 113.0
605 No Not Graduate 113.0
编辑:
更好的解决方案是 groupby
和 apply
将 NaN
替换为 median
:
url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas
#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)
print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate NaN
35 No Graduate NaN
63 No Graduate NaN
81 Yes Graduate NaN
95 No Graduate NaN
102 No Graduate NaN
103 No Graduate NaN
113 Yes Graduate NaN
127 No Graduate NaN
202 No Not Graduate NaN
284 No Graduate NaN
305 No Not Graduate NaN
322 No Not Graduate NaN
338 No Not Graduate NaN
387 No Not Graduate NaN
435 No Graduate NaN
437 No Graduate NaN
479 No Graduate NaN
524 No Graduate NaN
550 Yes Graduate NaN
551 No Not Graduate NaN
605 No Not Graduate NaN
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([ 0, 35, 63, 81, 95, 102, 103, 113, 127, 202, 284, 305, 322,
338, 387, 435, 437, 479, 524, 550, 551, 605],
dtype='int64')
# Replace missing values
df['LoanAmount'] = df.groupby(['Education','Self_Employed'])['LoanAmount']
.apply(lambda x: x.fillna(x.median()))
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
Self_Employed Education LoanAmount
0 No Graduate 130.0
35 No Graduate 130.0
63 No Graduate 130.0
81 Yes Graduate 157.5
95 No Graduate 130.0
102 No Graduate 130.0
103 No Graduate 130.0
113 Yes Graduate 157.5
127 No Graduate 130.0
202 No Not Graduate 113.0
284 No Graduate 130.0
305 No Not Graduate 113.0
322 No Not Graduate 113.0
338 No Not Graduate 113.0
387 No Not Graduate 113.0
435 No Graduate 130.0
437 No Graduate 130.0
479 No Graduate 130.0
524 No Graduate 130.0
550 Yes Graduate 157.5
551 No Not Graduate 113.0
605 No Not Graduate 113.0
编辑:
还有一个问题:
ValueError:输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值。
解决办法是替换NaN
s:
df['Loan_Status'].fillna('No',inplace=True)
df['Credit_History'].fillna(0,inplace=True)
outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']
classification_model(model, df, predictor_var,outcome_var)
【讨论】:
当然,但首先不要忘记评论#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
我还需要注释数据透视表的 fage(x) 函数吗
fage
函数只返回2列,但我认为没有必要,所以我省略了。
感谢它的工作,但现在我似乎收到此错误 ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
Traceback(最近一次调用最后):文件“model.py”,第 129 行,在 这似乎有效:
df = pd.read_csv('01_scratch_train.csv') # work with original data #
df['Self_Employed'].fillna('No', inplace=True)
table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']]
def fage(x):
return table.loc[x['Self_Employed'],x['Education']]
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']] # rechecking all values with NaN in LoanAmount column. No missing values.
【讨论】:
【参考方案3】:我也遇到了同样的问题。 这是对我有用的解决方案, 问题是你试图填充一个空的选择,因为你已经这样做了: df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
因此,当您选择 df['LoanAmount'].isnull() 时,将产生一个空选择。 这就是为什么这行代码不起作用: df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)
尝试在此行前面加上 #:df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True) 代码在执行后应该可以工作。
【讨论】:
以上是关于ValueError: <class 'pandas.core.frame.DataFrame'> 的填充值无效的主要内容,如果未能解决你的问题,请参考以下文章
ValueError:无效的文件路径或缓冲区对象类型:<class 'dict'> python
ValueError:无法找到可以处理输入的数据适配器:<class 'NoneType'>,<class 'NoneType'> in keras model.predict
ValueError:不是 TBLoader 或 TBPlugin 子类:<class 'tensorboard_plugin_wit.wit_plugin_loader.WhatIfToolP
ValueError: 'format' in __slots__ conflicts with class variable