ValueError: <class 'pandas.core.frame.DataFrame'> 的填充值无效

Posted

技术标签:

【中文标题】ValueError: <class \'pandas.core.frame.DataFrame\'> 的填充值无效【英文标题】:ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>ValueError: <class 'pandas.core.frame.DataFrame'> 的填充值无效 【发布时间】:2017-11-11 01:18:25 【问题描述】:

我正在练习贷款预测练习问题,并尝试在我的数据中填充缺失值。我从here 获得数据。为了完成这个问题,我关注了这个tutorial。

您可以在 GitHub 上找到我正在使用的整个代码(文件名 model.py)和数据 here。

DataFrame 如下所示:

df[['Loan_ID', 'Self_Employed', 'Education', 'LoanAmount']].head(10)
Out: 
    Loan_ID Self_Employed     Education  LoanAmount
0  LP001002            No      Graduate         NaN
1  LP001003            No      Graduate       128.0
2  LP001005           Yes      Graduate        66.0
3  LP001006            No  Not Graduate       120.0
4  LP001008            No      Graduate       141.0
5  LP001011           Yes      Graduate       267.0
6  LP001013            No  Not Graduate        95.0
7  LP001014            No      Graduate       158.0
8  LP001018            No      Graduate       168.0
9  LP001020            No      Graduate       349.0

最后一行执行后(对应model.py文件中的第60行)

url = 'https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'
df = pd.read_csv(url) 
df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)
df['Self_Employed'].fillna('No',inplace=True)

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)
# Define function to return value of this pivot_table
def fage(x):
 return table.loc[x['Self_Employed'],x['Education']]
# Replace missing values
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

我收到此错误

ValueError                                Traceback (most recent call last)
<ipython-input-40-5146e49c2460> in <module>()
----> 1 df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
   2368                                           axis=axis, inplace=inplace,
   2369                                           limit=limit, downcast=downcast,
-> 2370                                           **kwargs)
   2371 
   2372     @Appender(generic._shared_docs['shift'] % _shared_doc_kwargs)

/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in fillna(self, value, method, axis, inplace, limit, downcast)
   3264                 else:
   3265                     raise ValueError("invalid fill value with a %s" %
-> 3266                                      type(value))
   3267 
   3268                 new_data = self._data.fillna(value=value, limit=limit,

ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>

如何在不出现此错误的情况下填充缺失值?

【问题讨论】:

df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()] 这没有意义。您正在寻找空值并试图用空值填充空值? @ayhan 我按照教程中的方式进行操作,我认为它应该用 true 填充缺失值 对不起,它试图用df[df['LoanAmount'].isnull()].apply(fage, axis=1) 填充你能包括函数 fage 定义和一个小的可重现数据集吗? @ayhan 我已经给出了我正在使用的整个代码的链接,但以防万一这里是 def fage(x): ...: return table.loc[x['Self_Employed'] ,x['教育']] @ayhan 至于数据集,它也在我的 github 链接上有问题,数据很小,你可以从那里下载 【参考方案1】:

似乎教程的作者想用table的值替换NaN

但需要先通过unstackset_index 创建Series 以对齐数据。

首先删除将NaN替换为mean

url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'

df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas

#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

df['Self_Employed'].fillna('No',inplace=True)

table = df.pivot_table(values='LoanAmount', 
                       index='Self_Employed', 
                       columns='Education', 
                       aggfunc=np.median)

print (table.unstack())
Education     Self_Employed
Graduate      No               130.0
              Yes              157.5
Not Graduate  No               113.0
              Yes              130.0
dtype: float64

#check all values with NaN in LoanAmount column
print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate         NaN
35             No      Graduate         NaN
63             No      Graduate         NaN
81            Yes      Graduate         NaN
95             No      Graduate         NaN
102            No      Graduate         NaN
103            No      Graduate         NaN
113           Yes      Graduate         NaN
127            No      Graduate         NaN
202            No  Not Graduate         NaN
284            No      Graduate         NaN
305            No  Not Graduate         NaN
322            No  Not Graduate         NaN
338            No  Not Graduate         NaN
387            No  Not Graduate         NaN
435            No      Graduate         NaN
437            No      Graduate         NaN
479            No      Graduate         NaN
524            No      Graduate         NaN
550           Yes      Graduate         NaN
551            No  Not Graduate         NaN
605            No  Not Graduate         NaN

#for check get all indexes where NaNs
idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([  0,  35,  63,  81,  95, 102, 103, 113, 127, 202, 284, 305, 322,
            338, 387, 435, 437, 479, 524, 550, 551, 605],

# Replace missing values
df = df.set_index(['Education','Self_Employed'])
df['LoanAmount'].fillna(table.unstack(), inplace=True)
df = df.reset_index()

#check output - filter only indexes where NaNs before
print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate       130.0
35             No      Graduate       130.0
63             No      Graduate       130.0
81            Yes      Graduate       157.5
95             No      Graduate       130.0
102            No      Graduate       130.0
103            No      Graduate       130.0
113           Yes      Graduate       157.5
127            No      Graduate       130.0
202            No  Not Graduate       113.0
284            No      Graduate       130.0
305            No  Not Graduate       113.0
322            No  Not Graduate       113.0
338            No  Not Graduate       113.0
387            No  Not Graduate       113.0
435            No      Graduate       130.0
437            No      Graduate       130.0
479            No      Graduate       130.0
524            No      Graduate       130.0
550           Yes      Graduate       157.5
551            No  Not Graduate       113.0
605            No  Not Graduate       113.0

编辑:

更好的解决方案是 groupbyapplyNaN 替换为 median

url='https://raw.githubusercontent.com/Aniruddh-SK/Loan-Prediction-Problem/master/train.csv'

df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas

#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

df['Self_Employed'].fillna('No',inplace=True)


print (df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate         NaN
35             No      Graduate         NaN
63             No      Graduate         NaN
81            Yes      Graduate         NaN
95             No      Graduate         NaN
102            No      Graduate         NaN
103            No      Graduate         NaN
113           Yes      Graduate         NaN
127            No      Graduate         NaN
202            No  Not Graduate         NaN
284            No      Graduate         NaN
305            No  Not Graduate         NaN
322            No  Not Graduate         NaN
338            No  Not Graduate         NaN
387            No  Not Graduate         NaN
435            No      Graduate         NaN
437            No      Graduate         NaN
479            No      Graduate         NaN
524            No      Graduate         NaN
550           Yes      Graduate         NaN
551            No  Not Graduate         NaN
605            No  Not Graduate         NaN

idx = df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']].index
print (idx)
Int64Index([  0,  35,  63,  81,  95, 102, 103, 113, 127, 202, 284, 305, 322,
            338, 387, 435, 437, 479, 524, 550, 551, 605],
           dtype='int64')

# Replace missing values
df['LoanAmount'] = df.groupby(['Education','Self_Employed'])['LoanAmount']
                     .apply(lambda x: x.fillna(x.median()))

print (df.loc[df.index.isin(idx), ['Self_Employed','Education', 'LoanAmount']])
    Self_Employed     Education  LoanAmount
0              No      Graduate       130.0
35             No      Graduate       130.0
63             No      Graduate       130.0
81            Yes      Graduate       157.5
95             No      Graduate       130.0
102            No      Graduate       130.0
103            No      Graduate       130.0
113           Yes      Graduate       157.5
127            No      Graduate       130.0
202            No  Not Graduate       113.0
284            No      Graduate       130.0
305            No  Not Graduate       113.0
322            No  Not Graduate       113.0
338            No  Not Graduate       113.0
387            No  Not Graduate       113.0
435            No      Graduate       130.0
437            No      Graduate       130.0
479            No      Graduate       130.0
524            No      Graduate       130.0
550           Yes      Graduate       157.5
551            No  Not Graduate       113.0
605            No  Not Graduate       113.0

编辑:

还有一个问题:

ValueError:输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值。

解决办法是替换NaNs:

df['Loan_Status'].fillna('No',inplace=True)
df['Credit_History'].fillna(0,inplace=True) 

outcome_var = 'Loan_Status'
model = LogisticRegression()
predictor_var = ['Credit_History']

classification_model(model, df, predictor_var,outcome_var)

【讨论】:

当然,但首先不要忘记评论#df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True) 我还需要注释数据透视表的 fage(x) 函数吗 fage 函数只返回2列,但我认为没有必要,所以我省略了。 感谢它的工作,但现在我似乎收到此错误 ValueError: Input contains NaN, infinity or a value too large for dtype('float64') Traceback(最近一次调用最后):文件“model.py”,第 129 行,在 分类模型(模型,df,predictor_var,outcome_var)文件“model.py”,第 96 行,在分类模型模型.fit(数据[预测],数据[结果])【参考方案2】:

这似乎有效:

df = pd.read_csv('01_scratch_train.csv') # work with original data #

df['Self_Employed'].fillna('No', inplace=True)

table = df.pivot_table(values='LoanAmount', index='Self_Employed' ,columns='Education', aggfunc=np.median)

df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']]

def fage(x):
    return table.loc[x['Self_Employed'],x['Education']]


df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

df.loc[df['LoanAmount'].isnull(), ['Self_Employed','Education', 'LoanAmount']] # rechecking all values with NaN in LoanAmount column. No missing values.

【讨论】:

【参考方案3】:

我也遇到了同样的问题。 这是对我有用的解决方案, 问题是你试图填充一个空的选择,因为你已经这样做了: df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True)

因此,当您选择 df['LoanAmount'].isnull() 时,将产生一个空选择。 这就是为什么这行代码不起作用: df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage, axis=1), inplace=True)

尝试在此行前面加上 #:df['LoanAmount'].fillna(df['LoanAmount'].mean(), inplace=True) 代码在执行后应该可以工作。

【讨论】:

以上是关于ValueError: <class 'pandas.core.frame.DataFrame'> 的填充值无效的主要内容,如果未能解决你的问题,请参考以下文章

ValueError:无效的文件路径或缓冲区对象类型:<class 'dict'> python

ValueError:无法找到可以处理输入的数据适配器:<class 'NoneType'>,<class 'NoneType'> in keras model.predict

ValueError:不是 TBLoader 或 TBPlugin 子类:<class 'tensorboard_plugin_wit.wit_plugin_loader.WhatIfToolP

ValueError: 'format' in __slots__ conflicts with class variable

ValueError:解包的值太多(Python 2.7)

ValueError:无法解释输入“State/UnionTerritory”