Python - 在字典列表中查找重复项并将它们分组
Posted
技术标签:
【中文标题】Python - 在字典列表中查找重复项并将它们分组【英文标题】:Python - Find duplicates in a list of dictionaries and group them 【发布时间】:2013-10-01 09:52:20 【问题描述】:我不是程序员,也是 python 新手,我有一个来自 json 文件的 dicts 列表:
# JSON file (film.json)
["year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["19,00"],
"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fiction"], "price": ["20,00"],
"year": ["2003"], "director": ["Tarantino"], "film": ["Kill Bill vol.1"], "price": ["10,00"],
"year": ["2003"], "director": ["Wachowski"], "film": ["The Matrix Reloaded"], "price": ["9,99"],
"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fyction"], "price": ["15,00"],
"year": ["1994"], "director": ["E. de Souza"], "film": ["Street Fighter"], "price": ["2,00"],
"year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["20,00"],
"year": ["1982"], "director": ["Ridley Scott"], "film": ["Blade Runner"], "price": ["19,99"]]
我可以通过以下方式导入 json 文件:
import json
json_file = open('film.json')
f = json.load(json_file)
但在那之后我无法在f
中找到事件并按电影标题将它们分组。
这就是我想要实现的目标:
## result grouped by 'film'
#group 1
"year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["19,00"]
"year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["20,00"]
#group 2
"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fiction"], "price": ["20,00"]
"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fyction"], "price": ["15,00"]
#group X
...
或者更好:
new_dict = 'group1':[[],[],...] , 'group2':[[],[],...] , 'groupX':[...]
目前我正在使用嵌套的for
进行测试,但没有运气..
谢谢。
注意:“pulp fyction”是未来使用模糊字符串匹配实现的一个通缉错误,现在我只需要一个“重复分组器”
note2:使用 python 2.x
【问题讨论】:
你在分组什么?单单标题?头衔+导演+年份? docs.python.org/2/library/itertools.html#itertools.groupby 为什么不按电影命名您的小组? @wim 根据“电影”键中的值对整个 dict 行(标题、导演、年份、价格)进行分组。所以是的,只有标题。 【参考方案1】:由于您的数据未排序,请使用collections.defaultdict()
object 为新键具体化列表,然后按电影标题键:
from collections import defaultdict
grouped = defaultdict(list)
for film in f:
grouped[film['film'][0]].append(film)
film['film'][0]
值用于对影片进行分组。如果您想使用更复杂的标题分组,则必须创建该键的规范版本。
演示:
>>> from collections import defaultdict
>>> import json
>>> with open('film.json') as film_file:
... f = json.load(film_file)
...
>>> grouped = defaultdict(list)
>>> for film in f:
... grouped[film['film'][0]].append(film)
...
>>> grouped
defaultdict(<type 'list'>, u'Street Fighter': [u'director': [u'E. de Souza'], u'price': [u'2,00'], u'film': [u'Street Fighter'], u'year': [u'1994']], u'Pulp Fiction': [u'director': [u'Tarantino'], u'price': [u'20,00'], u'film': [u'Pulp Fiction'], u'year': [u'1994']], u'Pulp Fyction': [u'director': [u'Tarantino'], u'price': [u'15,00'], u'film': [u'Pulp Fyction'], u'year': [u'1994']], u'The Matrix': [u'director': [u'Wachowski'], u'price': [u'19,00'], u'film': [u'The Matrix'], u'year': [u'1999'], u'director': [u'Wachowski'], u'price': [u'20,00'], u'film': [u'The Matrix'], u'year': [u'1999']], u'Blade Runner': [u'director': [u'Ridley Scott'], u'price': [u'19,99'], u'film': [u'Blade Runner'], u'year': [u'1982']], u'Kill Bill vol.1': [u'director': [u'Tarantino'], u'price': [u'10,00'], u'film': [u'Kill Bill vol.1'], u'year': [u'2003']], u'The Matrix Reloaded': [u'director': [u'Wachowski'], u'price': [u'9,99'], u'film': [u'The Matrix Reloaded'], u'year': [u'2003']])
>>> from pprint import pprint
>>> pprint(dict(grouped))
u'Blade Runner': [u'director': [u'Ridley Scott'],
u'film': [u'Blade Runner'],
u'price': [u'19,99'],
u'year': [u'1982']],
u'Kill Bill vol.1': [u'director': [u'Tarantino'],
u'film': [u'Kill Bill vol.1'],
u'price': [u'10,00'],
u'year': [u'2003']],
u'Pulp Fiction': [u'director': [u'Tarantino'],
u'film': [u'Pulp Fiction'],
u'price': [u'20,00'],
u'year': [u'1994']],
u'Pulp Fyction': [u'director': [u'Tarantino'],
u'film': [u'Pulp Fyction'],
u'price': [u'15,00'],
u'year': [u'1994']],
u'Street Fighter': [u'director': [u'E. de Souza'],
u'film': [u'Street Fighter'],
u'price': [u'2,00'],
u'year': [u'1994']],
u'The Matrix': [u'director': [u'Wachowski'],
u'film': [u'The Matrix'],
u'price': [u'19,00'],
u'year': [u'1999'],
u'director': [u'Wachowski'],
u'film': [u'The Matrix'],
u'price': [u'20,00'],
u'year': [u'1999']],
u'The Matrix Reloaded': [u'director': [u'Wachowski'],
u'film': [u'The Matrix Reloaded'],
u'price': [u'9,99'],
u'year': [u'2003']]
使用SoundEx 对影片进行分组很简单:
from itertools import groupby, islice, ifilter
_codes = ('bfpv', 'cgjkqsxz', 'dt', 'l', 'mn', 'r')
_sounds = c: str(i) for i, code in enumerate(_codes, 1) for c in code
_sounds.update(dict.fromkeys('aeiouy'))
def soundex(word, _sounds=_sounds):
grouped = groupby(_sounds[c] for c in word.lower() if c in _sounds)
if _sounds.get(word[0].lower()):
next(grouped) # remove first group.
sdx = ''.join([k for k, g in islice((g for g in grouped if g[0]), 3)])
return word[0].upper() + format(sdx, '<03')
grouped_by_soundex = defaultdict(list)
for film in f:
grouped_by_soundex[soundex(film['film'][0])].append(film)
导致:
>>> pprint(dict(grouped_by_soundex))
u'B436': [u'director': [u'Ridley Scott'],
u'film': [u'Blade Runner'],
u'price': [u'19,99'],
u'year': [u'1982']],
u'K414': [u'director': [u'Tarantino'],
u'film': [u'Kill Bill vol.1'],
u'price': [u'10,00'],
u'year': [u'2003']],
u'P412': [u'director': [u'Tarantino'],
u'film': [u'Pulp Fiction'],
u'price': [u'20,00'],
u'year': [u'1994'],
u'director': [u'Tarantino'],
u'film': [u'Pulp Fyction'],
u'price': [u'15,00'],
u'year': [u'1994']],
u'S363': [u'director': [u'E. de Souza'],
u'film': [u'Street Fighter'],
u'price': [u'2,00'],
u'year': [u'1994']],
u'T536': [u'director': [u'Wachowski'],
u'film': [u'The Matrix'],
u'price': [u'19,00'],
u'year': [u'1999'],
u'director': [u'Wachowski'],
u'film': [u'The Matrix Reloaded'],
u'price': [u'9,99'],
u'year': [u'2003'],
u'director': [u'Wachowski'],
u'film': [u'The Matrix'],
u'price': [u'20,00'],
u'year': [u'1999']]
【讨论】:
【参考方案2】:如果是一次性的并且我很着急,我会这样做。在这个例子中,假设您的字典列表是 lod,并且电影标题将永远是一个包含一个项目的列表
new_dict = k:[d for d in lod if d.get('film')[0] == k] for k in set(d.get('film')[0] for d in l)
为了使它更具可读性,并解释它在做什么,同样的事情被打破了,字典列表再次被列出:
#get all the unique film names
# note: the [0] is because its a list for the title, and set doesn't work with lists,
#so we're just taking the first one for this example.
films = set(d.get('film')[0] for d in lod)
#create a dictionary
new_dict =
#iterate over the unique film names
for k in films:
#make a list of all the films that match the name we're on
filmswiththisname = [d for d in lod if d.get('film')[0] == k]
#add the list of films to the new dictionary with the film name as the key.
new_dict[k] = filmswiththisname
【讨论】:
以上是关于Python - 在字典列表中查找重复项并将它们分组的主要内容,如果未能解决你的问题,请参考以下文章
Python - 检查列表中的重复项并将重复项添加在一起以使用总和值更新列表