Python:合并数据框的几列而没有重复的数据
Posted
技术标签:
【中文标题】Python:合并数据框的几列而没有重复的数据【英文标题】:Python : Merge several columns of a dataframe without having duplicates of data 【发布时间】:2021-02-23 20:48:28 【问题描述】:假设我有这个数据框:
Name = ['Lolo', 'Mike', 'Tobias','Luke','Sam']
Age = [19, 34, 13, 45, 52]
Info_1 = ['Tall', 'Large', 'Small', 'Small','']
Info_2 = ['New York', 'Paris', 'Lisbon', '', 'Berlin']
Info_3 = ['Tall', 'Paris', 'Hi', 'Small', 'Thanks']
Data = [123,268,76,909,87]
Sex = ['F', 'M', 'M','M','M']
df = pd.DataFrame('Name' : Name, 'Age' : Age, 'Info_1' : Info_1, 'Info_2' : Info_2, 'Info_3' : Info_3, 'Data' : Data, 'Sex' : Sex)
print(df)
Name Age Info_1 Info_2 Info_3 Data Sex
0 Lolo 19 Tall New York Tall 123 F
1 Mike 34 Large Paris Paris 268 M
2 Tobias 13 Small Lisbon Hi 76 M
3 Luke 45 Small Small 909 M
4 Sam 52 Berlin Thanks 87 M
我想合并这个数据框四列的数据:Info_1、Info_2、Info_3、Data。 我想合并它们而不需要每一行的数据重复。这意味着对于“0”行,我不想有两次“高”。所以最后我想得到类似的东西:
Name Age Info Sex
0 Lolo 19 Tall New York 123 F
1 Mike 34 Large Paris 268 M
2 Tobias 13 Small Lisbon Hi 76 M
3 Luke 45 Small 909 M
4 Sam 52 Berlin Thanks 87 M
我试过这个功能来合并数据:
di['period'] = df[['Info_1', 'Info_2', 'Info_3' 'Data']].agg('-'.join, axis=1)
但是我收到一个错误,因为它需要一个字符串,如何合并“数据”列的数据?以及如何检查我没有创建重复项
谢谢
【问题讨论】:
【参考方案1】:您的Data
列似乎是int
类型。先转成字符串:
df['Data'] = df['Data'].astype(str)
df['period'] = (df[['Info_1','Info_2','Info_3','Data']]
.apply(lambda x: ' '.join(x[x!=''].unique()), axis=1)
)
输出:
Name Age Info_1 Info_2 Info_3 Data Sex period
0 Lolo 19 Tall New York Tall 123 F Tall New York 123
1 Mike 34 Large Paris Paris 268 M Large Paris 268
2 Tobias 13 Small Lisbon Hi 76 M Small Lisbon Hi 76
3 Luke 45 Small Small 909 M Small 909
4 Sam 52 Berlin Thanks 87 M Berlin Thanks 87
【讨论】:
【参考方案2】:我认为最简单的方法可能是首先将所需的所有字段与中间的空格连接起来:
df['Info'] = df.Info_1 + ' ' + df.Info_2 + ' ' + df.Info_3 + ' ' + df.Data.astype(str)
然后你可以编写一个函数来从字符串中删除重复的单词,如下所示:
def remove_dup_words(s):
words = s.split(' ')
unique_words = pd.Series(words).drop_duplicates().tolist()
return ' '.join(unique_words)
并将该函数应用于Info
字段:
df['Info'] = df.Info.apply(remove_dup_words)
所有代码放在一起:
import pandas as pd
def remove_dup_words(s):
words = s.split(' ')
unique_words = pd.Series(words).drop_duplicates().tolist()
return ' '.join(unique_words)
Name = ['Lolo', 'Mike', 'Tobias','Luke','Sam']
Age = [19, 34, 13, 45, 52]
Info_1 = ['Tall', 'Large', 'Small', 'Small','']
Info_2 = ['New York', 'Paris', 'Lisbon', '', 'Berlin']
Info_3 = ['Tall', 'Paris', 'Hi', 'Small', 'Thanks']
Data = [123,268,76,909,87]
Sex = ['F', 'M', 'M','M','M']
df = pd.DataFrame('Name' : Name, 'Age' : Age, 'Info_1' : Info_1, 'Info_2' : Info_2, 'Info_3' : Info_3, 'Data' : Data, 'Sex' : Sex)
df['Info'] = df.Info_1 + ' ' + df.Info_2 + ' ' + df.Info_3 + ' ' + df.Data.astype(str)
df['Info'] = df.Info.apply(remove_dup_words)
print(df)
Name Age Info_1 Info_2 Info_3 Data Sex Info
0 Lolo 19 Tall New York Tall 123 F Tall New York 123
1 Mike 34 Large Paris Paris 268 M Large Paris 268
2 Tobias 13 Small Lisbon Hi 76 M Small Lisbon Hi 76
3 Luke 45 Small Small 909 M Small 909
4 Sam 52 Berlin Thanks 87 M Berlin Thanks 87
【讨论】:
以上是关于Python:合并数据框的几列而没有重复的数据的主要内容,如果未能解决你的问题,请参考以下文章
Python Pandas Dataframe 合并并只选择几列
Spark dataframe 中某几列合并成vector或拆分