将熊猫数据框列列表值拆分为重复行[重复]

Posted

技术标签:

【中文标题】将熊猫数据框列列表值拆分为重复行[重复]【英文标题】:Split pandas dataframe column list values to duplicate rows [duplicate] 【发布时间】:2019-12-28 06:10:33 【问题描述】:

我有一个如下所示的数据框:

publication_title    authors                             type ...
title 1              ['author1', 'author2', 'author3']   proceedings
title 2              ['author4', 'author5']              collections
title 3              ['author6', 'author7']              books
.
.
. 

我想要做的是获取列'authors'并通过复制所有其他列将其中的列表分成几行,我还想将结果存储在一个名为:'author'的新列中并保留原始列。

以下内容正是我想要实现的目标:

publication_title    authors                             author          type ...
title 1              ['author1', 'author2', 'author3']   author1         proceedings
title 1              ['author1', 'author2', 'author3']   author2         proceedings
title 1              ['author1', 'author2', 'author3']   author3         proceedings
title 2              ['author4', 'author5']              author4         collections
title 2              ['author4', 'author5']              author5         collections
title 3              ['author6', 'author7']              author6         books
title 3              ['author6', 'author7']              author7         books
.
.
. 

我曾尝试使用 pandas DataFrame 的 explode 方法来实现这一点,但我找不到将结果存储在新列中的方法。

感谢您的帮助。

【问题讨论】:

【参考方案1】:

因为pandas 0.25.0 我们有了explode 方法。首先我们复制authors 列并同时使用assign 重命名它,然后我们将此列分解为行并复制其他列:

df.assign(author=df['authors']).explode('author')

输出

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
0           title_1  [author1, author2, author3]  proceedings  author2
0           title_1  [author1, author2, author3]  proceedings  author3
1           title_2           [author4, author5]  collections  author4
1           title_2           [author4, author5]  collections  author5
2           title_3           [author6, author7]        books  author6
2           title_3           [author6, author7]        books  author7

如果要删除重复索引,请使用reset_index

df.assign(author=df['authors']).explode('author').reset_index(drop=True)

输出

  publication_title                      authors         type   author
0           title_1  [author1, author2, author3]  proceedings  author1
1           title_1  [author1, author2, author3]  proceedings  author2
2           title_1  [author1, author2, author3]  proceedings  author3
3           title_2           [author4, author5]  collections  author4
4           title_2           [author4, author5]  collections  author5
5           title_3           [author6, author7]        books  author6
6           title_3           [author6, author7]        books  author7

【讨论】:

谢谢@Erfan,旅游解决方案正是我想要的。【参考方案2】:

您可以先与作者创建一个新的DataFrame

df2 = pd.DataFrame(df['author'].tolist(), index=df.index).stack()

接下来我们删除二级索引:

df2.index = df2.index.droplevel(1)

接下来我们可以在第二个轴上连接:

>>> pd.concat([df, df2], axis=1)
     title                       author         type        0
0  title 1  [author1, author2, author3]  proceedings  author1
0  title 1  [author1, author2, author3]  proceedings  author2
0  title 1  [author1, author2, author3]  proceedings  author3
1  title 2           [author4, author5]  collections  author4
1  title 2           [author4, author5]  collections  author5
2  title 3           [author6, author7]        books  author6
2  title 3           [author6, author7]        books  author7

或单线:

>>> pd.concat([df, pd.DataFrame(df['author'].tolist(), index=df.index).stack().reset_index(level=1, drop=True)], axis=1)
     title                       author         type        0
0  title 1  [author1, author2, author3]  proceedings  author1
0  title 1  [author1, author2, author3]  proceedings  author2
0  title 1  [author1, author2, author3]  proceedings  author3
1  title 2           [author4, author5]  collections  author4
1  title 2           [author4, author5]  collections  author5
2  title 3           [author6, author7]        books  author6
2  title 3           [author6, author7]        books  author7

【讨论】:

【参考方案3】:

您已发现explode,这意味着您快到了!只需将原始数据与分解数据合并,请参见下面的代码,

# data
df = pd.DataFrame('publication_title':['title_1','title_2','title_3'],
              'authors':[['author1', 'author2', 'author3'],['author4', 'author5'],['author6', 'author7']],
              'type':['proceedings','collections','books'])
(df.explode(column='authors')
   .rename(columns='authors':'author')
   .merge(df))

【讨论】:

以上是关于将熊猫数据框列列表值拆分为重复行[重复]的主要内容,如果未能解决你的问题,请参考以下文章

将数据框列中的列表拆分为多列[重复]

将列表列表拆分为熊猫数据框[重复]

根据条件组合熊猫行[重复]

将包含列表的列拆分为熊猫中的不同行[重复]

如何重新索引熊猫数据框以将起始索引值重置为零? [重复]

根据值的数量将熊猫列拆分为多个单独的列[重复]