如何在同一行名称中逐列插入空行的值,然后将插值数据复制到原始DataFrame?
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何在同一行名称中逐列插入空行的值,然后将插值数据复制到原始DataFrame?相关的知识,希望对你有一定的参考价值。
我有一个电子表格,提供了2019年世界幸福报告的统计数据,后来将用于可视化和线性回归问题(这是一个小组项目,我的部分是清理数据,以便尽可能少的空值) 。
我只对2010年以及之后的年份感兴趣。某些国家的数据在特定年份完全缺失(例如,埃塞俄比亚缺少2010年和2011年)。我想通过插值来预测那些国家(生命阶梯和人均GDP)的缺失参数。
该文件可在此处找到:https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls
到目前为止,我所做的是为每个国家/地区创建一个新的DataFrame并尝试为该国家/地区进行插值。 (代码如下。)请注意,dropdata是我通过删除可用信息太少的国家创建的DataFrame,例如阿曼。
另外,我在原始电子表格中手动插入了国家和年份(例如,埃塞俄比亚,2011年)和空白数据值的行。
但插值根本不起作用。我一直看到NaN值,并且在打印DataFrame时,我插入的新行根本没有显示。
以下是示例输出。
Country name Year Life Ladder Log GDP per capita Social support
Ethiopia 2012 4.561169 7.115237 0.658794
Ethiopia 2013 4.444827 7.189737 0.602482
Ethiopia 2014 4.506647 7.261595 0.640452
Ethiopia 2015 4.573155 7.335052 0.625597
Ethiopia 2016 4.297849 7.382929 0.718719
Ethiopia 2017 4.180315 7.455834 0.733540
Ethiopia 2018 4.379262 7.524517 0.740155
Healthy life expectancy at birth Freedom to make life choices
55.200001 0.776308
55.799999 0.706796
56.400002 0.693559
57.000000 0.802643
57.500000 0.744308
58.000000 0.717101
58.500000 0.740343
Generosity Perceptions of corruption
-0.036612 NaN
-0.000997 0.750478
0.086612 0.701800
0.118702 0.567027
0.045363 0.702881
0.007519 0.756899
0.043274 0.799466
我使用的代码。
country_list = dropdata['Country name']
for country in country_list:
countryDF = dropdata.loc[dropdata['Country name'] == country, :] #Creates a dataFrame for each country.
countryDF2 = countryDF.iloc[0:20, 0:9] #We are interested only in the first 9 rows.
countryDF2.interpolate(method ='values', axis = 0, limit_direction ='both', limit = 3)
尽管已经在两个方向上进行了插值,但仍然存在NaN值。更重要的是,我必须将每个国家/地区的DataFrame中的插值复制回所有行的原始DataFrame(将被视为dropdata)。我从哪里开始?
使用GroupBy.apply
的自定义函数仅按位置过滤值,但首先使用DataFrame.reindex
添加MultiIndex.from_product
缺少的行:
df = pd.read_excel('Chapter2OnlineData.xls')
mux = pd.MultiIndex.from_product([df['Country name'].unique(),
np.arange(df['Year'].min(), df['Year'].max() + 1)],
names=['Country name','Year'])
df = df.set_index(['Country name','Year']).reindex(mux).reset_index()
print (df[df['Country name'] == 'Algeria'].iloc[0:20, 0:9])
Country name Year Life Ladder Log GDP per capita Social support
28 Algeria 2005 NaN NaN NaN
29 Algeria 2006 NaN NaN NaN
30 Algeria 2007 NaN NaN NaN
31 Algeria 2008 NaN NaN NaN
32 Algeria 2009 NaN NaN NaN
33 Algeria 2010 5.463567 9.462701 NaN
34 Algeria 2011 5.317194 9.471962 0.810234
35 Algeria 2012 5.604596 9.485086 0.839397
36 Algeria 2013 NaN NaN NaN
37 Algeria 2014 6.354898 9.509210 0.818189
38 Algeria 2015 NaN NaN NaN
39 Algeria 2016 5.340854 9.541166 0.748588
40 Algeria 2017 5.248912 9.540639 0.806754
41 Algeria 2018 5.043086 9.557952 0.798651
Healthy life expectancy at birth Freedom to make life choices
28 NaN NaN
29 NaN NaN
30 NaN NaN
31 NaN NaN
32 NaN NaN
33 64.500000 0.592696
34 64.660004 0.529561
35 64.820000 0.586663
36 NaN NaN
37 65.139999 NaN
38 NaN NaN
39 65.500000 NaN
40 65.699997 0.436670
41 65.900002 0.583381
Generosity Perceptions of corruption
28 NaN NaN
29 NaN NaN
30 NaN NaN
31 NaN NaN
32 NaN NaN
33 -0.229078 0.618038
34 -0.204406 0.637982
35 -0.195859 0.690116
36 NaN NaN
37 NaN NaN
38 NaN NaN
39 NaN NaN
40 -0.191522 0.699774
41 -0.172413 0.758704
def f(x):
x.iloc[0:20, 0:9] = x.iloc[0:20, 0:9].interpolate(method ='values',
axis = 0,
limit_direction ='both',
limit = 3)
return x
df = df.groupby('Country name').apply(f)
print (df[df['Country name'] == 'Algeria'].iloc[0:20, 0:9])
Country name Year Life Ladder Log GDP per capita Social support
28 Algeria 2005 NaN NaN NaN
29 Algeria 2006 NaN NaN NaN
30 Algeria 2007 5.463567 9.462701 NaN
31 Algeria 2008 5.463567 9.462701 0.810234
32 Algeria 2009 5.463567 9.462701 0.810234
33 Algeria 2010 5.463567 9.462701 0.810234
34 Algeria 2011 5.317194 9.471962 0.810234
35 Algeria 2012 5.604596 9.485086 0.839397
36 Algeria 2013 5.979747 9.497148 0.828793
37 Algeria 2014 6.354898 9.509210 0.818189
38 Algeria 2015 5.847876 9.525188 0.783389
39 Algeria 2016 5.340854 9.541166 0.748588
40 Algeria 2017 5.248912 9.540639 0.806754
41 Algeria 2018 5.043086 9.557952 0.798651
Healthy life expectancy at birth Freedom to make life choices
28 NaN NaN
29 NaN NaN
30 64.500000 0.592696
31 64.500000 0.592696
32 64.500000 0.592696
33 64.500000 0.592696
34 64.660004 0.529561
35 64.820000 0.586663
36 64.980000 0.556665
37 65.139999 0.526666
38 65.320000 0.496668
39 65.500000 0.466669
40 65.699997 0.436670
41 65.900002 0.583381
Generosity Perceptions of corruption
28 NaN NaN
29 NaN NaN
30 -0.229078 0.618038
31 -0.229078 0.618038
32 -0.229078 0.618038
33 -0.229078 0.618038
34 -0.204406 0.637982
35 -0.195859 0.690116
36 -0.194991 0.692048
37 -0.194124 0.693979
38 -0.193257 0.695911
39 -0.192389 0.697843
40 -0.191522 0.699774
41 -0.172413 0.758704
以上是关于如何在同一行名称中逐列插入空行的值,然后将插值数据复制到原始DataFrame?的主要内容,如果未能解决你的问题,请参考以下文章