处理字符串列表以删除重复项并添加相应的值
Posted
技术标签:
【中文标题】处理字符串列表以删除重复项并添加相应的值【英文标题】:processing a list of strings to remove duplicates and add corresponding value 【发布时间】:2018-03-28 22:38:56 【问题描述】:我有一个包含大约 10k(10,000) 行的 csv,如下所示:
1: ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277']
...
N: ['Andhra Pradesh-20', 'Rajasthan-60', 'Rajasthan-70']
我必须合并重复的值,例如:
['Andhra Pradesh-133', 'Meetai-5781'] // 5781 = 1358 + 2146 + 2277
任何人都可以建议一种快速的方法吗?
【问题讨论】:
【参考方案1】:将list comprehension
与groupby
一起使用:
from itertools import groupby
df = pd.DataFrame('a':[['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'],
['Andhra Pradesh-20', 'Rajasthan-60', 'Rajasthan-70']])
data = []
for x in df['a']:
b = [a.split('-') for a in x]
L = [t for k, g in groupby(b, key=lambda x: x[0])
for t in [k + '-' + str(sum((int(j) for i, j in g)))]]
data.append(L)
print (data)
[['Andhra Pradesh-133', 'Meetai-5781'], ['Andhra Pradesh-20', 'Rajasthan-130']]
df['b'] = data
print (df)
a \
0 [Andhra Pradesh-133, Meetai-1358, Meetai-2146,...
1 [Andhra Pradesh-20, Rajasthan-60, Rajasthan-70]
b
0 [Andhra Pradesh-133, Meetai-5781]
1 [Andhra Pradesh-20, Rajasthan-130]
编辑:
如果输入是文件的解决方案:
data = []
for line in open('file.csv'):
#strip new-line characters, split by [ and get second list
items = line.strip('\r\n" ]').split('[')[1]
#split lines, remove whitespace
items = [item.strip("' ") for item in items.split(',')]
#split to sublist
items = [a.split('-') for a in items]
#sum splitted sublists
items = [t for k, g in groupby(items, key=lambda x: x[0])
for t in [k + '-' + str(sum((int(j) for i, j in g)))]]
data.append(items)
print (data)
[['Andhra Pradesh-133', 'Meetai-5781'], ['Andhra Pradesh-20', 'Rajasthan-130']]
编辑:
您需要先出现[
,然后再剥离[]
:
data = []
for line in open('file.csv'):
#strip new-line characters, split by [ and get second list
items = line.strip('\r\n" ]').split('[', 1)[1]
#split lines, remove whitespace
items = [item.strip("'[] ") for item in items.split(',')]
#split to sublist
items = [a.split('-') for a in items]
print (items)
#sum splitted sublists
items = [t for k, g in groupby(items, key=lambda x: x[0])
for t in [k + '-' + str(sum((int(j) for i, j in g)))]]
data.append(items)
【讨论】:
如果我考虑 x=[['panjim-20', 'Uttar Pradesh-23185', 'Gujurat-1013', 'Uttar Pradesh-51'] 声明函数组,这里有一点疑问by 似乎不起作用。 b = [a.split('-') for a in x] for k,g in groupby(b, key=lambda x: x[0]): 不按“北方邦”分组,字符串也正是相同的。你能帮忙看看小姐是什么吗? 我认为有问题 double[[
。我编辑答案。
为错字道歉 x=['panjim-20', 'Uttar Pradesh-23185', 'Gujurat-1013', 'Uttar Pradesh-51'] 是我要处理的列表。 ?
你需要sorted
就像x=['panjim-20', 'Uttar Pradesh-23185', 'Gujurat-1013', 'Uttar Pradesh-51']
items = [a.split('-') for a in x]
a = [t for k, g in groupby(sorted(items), key=lambda x: x[0]) for t in [k + '-' + str(sum((int(j) for i, j in g)))]]
【参考方案2】:
我会为每一行创建一个字典。通过拆分或使用正则表达式来解析字符串数字。字符串例如'Andhra Pradesh' 是键,值是 int。将数字添加到由字符串确定的 dict 条目的值中。
【讨论】:
【参考方案3】:在 Pandas 中你可以做到
In [3475]: L = ['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277']
In [3476]: s = (pd.DataFrame(x.split('-') for x in L)
.assign(v=lambda x: x[1].astype(int))
.groupby(0)['v'].sum())
In [3478]: (s.index + '-' + s.values.astype(str)).tolist()
Out[3478]: ['Andhra Pradesh-133', 'Meetai-5781']
详情
In [3480]: pd.DataFrame(x.split('-') for x in L)
Out[3480]:
0 1
0 Andhra Pradesh 133
1 Meetai 1358
2 Meetai 2146
3 Meetai 2277
列1
是str
类型,我们是assign
ing 列v
类型int
In [3481]: pd.DataFrame(x.split('-') for x in L).assign(v=lambda x: x[1].astype(int))
Out[3481]:
0 1 v
0 Andhra Pradesh 133 133
1 Meetai 1358 1358
2 Meetai 2146 2146
3 Meetai 2277 2277
In [3479]: s
Out[3479]:
0
Andhra Pradesh 133
Meetai 5781
Name: v, dtype: int32
【讨论】:
【参考方案4】:不确定这是否是最快的方法,但这对我有用:
data = [
['Andhra Pradesh-133', 'Meetai-1358', 'Meetai-2146', 'Meetai-2277'],
['Andhra Pradesh-20','Rajasthan-60','Rajasthan-70']
]
values =
for row in data:
for x in row:
tokens = x.split('-')
values[tokens[0]] = int(tokens[1]) if tokens[0] not in values else values[tokens[0]] + int(tokens[1])
out = [x + '-' + str(y) for x,y in values.iteritems()]
print out # prints: ['Andhra Pradesh-153', 'Meetai-5781', 'Rajasthan-130']
【讨论】:
以上是关于处理字符串列表以删除重复项并添加相应的值的主要内容,如果未能解决你的问题,请参考以下文章
循环没有捕获重复项并在 Android(Java) 中删除它们
在 Bigquery 中使用结构数组删除重复项并选择不同的值