将熊猫数据框列列表值拆分为重复行[重复]
Posted
技术标签:
【中文标题】将熊猫数据框列列表值拆分为重复行[重复]【英文标题】:Split pandas dataframe column list values to duplicate rows [duplicate] 【发布时间】:2019-12-28 06:10:33 【问题描述】:我有一个如下所示的数据框:
publication_title authors type ...
title 1 ['author1', 'author2', 'author3'] proceedings
title 2 ['author4', 'author5'] collections
title 3 ['author6', 'author7'] books
.
.
.
我想要做的是获取列'authors'并通过复制所有其他列将其中的列表分成几行,我还想将结果存储在一个名为:'author'的新列中并保留原始列。
以下内容正是我想要实现的目标:
publication_title authors author type ...
title 1 ['author1', 'author2', 'author3'] author1 proceedings
title 1 ['author1', 'author2', 'author3'] author2 proceedings
title 1 ['author1', 'author2', 'author3'] author3 proceedings
title 2 ['author4', 'author5'] author4 collections
title 2 ['author4', 'author5'] author5 collections
title 3 ['author6', 'author7'] author6 books
title 3 ['author6', 'author7'] author7 books
.
.
.
我曾尝试使用 pandas DataFrame 的 explode 方法来实现这一点,但我找不到将结果存储在新列中的方法。
感谢您的帮助。
【问题讨论】:
【参考方案1】:因为pandas 0.25.0
我们有了explode
方法。首先我们复制authors
列并同时使用assign
重命名它,然后我们将此列分解为行并复制其他列:
df.assign(author=df['authors']).explode('author')
输出
publication_title authors type author
0 title_1 [author1, author2, author3] proceedings author1
0 title_1 [author1, author2, author3] proceedings author2
0 title_1 [author1, author2, author3] proceedings author3
1 title_2 [author4, author5] collections author4
1 title_2 [author4, author5] collections author5
2 title_3 [author6, author7] books author6
2 title_3 [author6, author7] books author7
如果要删除重复索引,请使用reset_index
:
df.assign(author=df['authors']).explode('author').reset_index(drop=True)
输出
publication_title authors type author
0 title_1 [author1, author2, author3] proceedings author1
1 title_1 [author1, author2, author3] proceedings author2
2 title_1 [author1, author2, author3] proceedings author3
3 title_2 [author4, author5] collections author4
4 title_2 [author4, author5] collections author5
5 title_3 [author6, author7] books author6
6 title_3 [author6, author7] books author7
【讨论】:
谢谢@Erfan,旅游解决方案正是我想要的。【参考方案2】:您可以先与作者创建一个新的DataFrame
:
df2 = pd.DataFrame(df['author'].tolist(), index=df.index).stack()
接下来我们删除二级索引:
df2.index = df2.index.droplevel(1)
接下来我们可以在第二个轴上连接:
>>> pd.concat([df, df2], axis=1)
title author type 0
0 title 1 [author1, author2, author3] proceedings author1
0 title 1 [author1, author2, author3] proceedings author2
0 title 1 [author1, author2, author3] proceedings author3
1 title 2 [author4, author5] collections author4
1 title 2 [author4, author5] collections author5
2 title 3 [author6, author7] books author6
2 title 3 [author6, author7] books author7
或单线:
>>> pd.concat([df, pd.DataFrame(df['author'].tolist(), index=df.index).stack().reset_index(level=1, drop=True)], axis=1)
title author type 0
0 title 1 [author1, author2, author3] proceedings author1
0 title 1 [author1, author2, author3] proceedings author2
0 title 1 [author1, author2, author3] proceedings author3
1 title 2 [author4, author5] collections author4
1 title 2 [author4, author5] collections author5
2 title 3 [author6, author7] books author6
2 title 3 [author6, author7] books author7
【讨论】:
【参考方案3】:您已发现explode
,这意味着您快到了!只需将原始数据与分解数据合并,请参见下面的代码,
# data
df = pd.DataFrame('publication_title':['title_1','title_2','title_3'],
'authors':[['author1', 'author2', 'author3'],['author4', 'author5'],['author6', 'author7']],
'type':['proceedings','collections','books'])
(df.explode(column='authors')
.rename(columns='authors':'author')
.merge(df))
【讨论】:
以上是关于将熊猫数据框列列表值拆分为重复行[重复]的主要内容,如果未能解决你的问题,请参考以下文章