列中每行具有唯一值的 Python/CSV 唯一行
Posted
技术标签:
【中文标题】列中每行具有唯一值的 Python/CSV 唯一行【英文标题】:Python/CSV unique rows with unique values per row in a column 【发布时间】:2016-09-16 13:49:10 【问题描述】:有这个数据集,数据集是虚构的:
cat sample.csv
id,fname,lname,education,gradyear,attributes
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,mit,2003,qa
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,harvard,2007,"test|admin,test"
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,harvard,2007,"test|admin,test"
"6F9619FF-8B86-D011-B42D-00C04FC964FF",john,smith,ft,2012,NULL
"6F9619FF-8B86-D011-B42D-00C04FC964F1",john,doe,htw,2000,dev
当我运行此脚本时,它会解析 csv 并找到唯一的行,并在找到更多行时将行连接到列中:
解析-csv.py
import itertools
from itertools import groupby
import csv
import pprint
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='sql dump parser')
parser.add_argument('-i','--input', help='input file', required=True)
parser.add_argument('-o','--output', help='output file', required=True)
args = parser.parse_args()
inputf = args.input
outputf = args.output
t = csv.reader(open(inputf, 'rb'))
t = list(t)
def join_rows(rows):
return [(e[0] if i < 1 else '|'.join(e)) for (i, e) in enumerate(zip(*rows))]
myfile = open(outputf, 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL, lineterminator='\n')
for name, rows in groupby(t, lambda x:x[0]):
wr.writerow(join_rows(rows))
#print join_rows(rows)
而不是另一个脚本,它确保每个列只有由“|”分隔的唯一值
独特的.py
import csv
import sys
from collections import OrderedDict
import argparse
csv.field_size_limit(sys.maxsize)
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='sql dump parser - unique')
parser.add_argument('-i','--input', help='input file', required=True)
parser.add_argument('-o','--output', help='output file', required=True)
args = parser.parse_args()
inputf = args.input
outputf = args.output
with open(inputf) as fin, open(outputf, 'wb') as fout:
csvin = csv.DictReader(fin)
csvout = csv.DictWriter(fout, fieldnames=csvin.fieldnames, quoting=csv.QUOTE_ALL,lineterminator='\n')
csvout.writeheader()
for row in csvin:
for k, v in row.items():
row[k] = '|'.join(OrderedDict.fromkeys(v.split('|')))
csvout.writerow(row)
它适用于 sample.csv
输出:
$ python parse-csv.py -i sample.csv -o sample-out.csv
$ python unique.py -i sample-out.csv -o sample-final.csv
$ cat sample-final.csv
"id","fname","lname","education","gradyear","attributes"
"6F9619FF-8B86-D011-B42D-00C04FC964FF","john","smith","mit|harvard|ft","2003|2007|2012","qa|test|admin,test|NULL"
"6F9619FF-8B86-D011-B42D-00C04FC964F1","john","doe","htw","2000","dev"
但是当我为此做同样的事情时:
(数据集是虚构的)
sample2.csv
id,lastname,firstname,middlename,address1,address2,city,zipcode,city2,zipcode2,emailaddress,website
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J",NULL,NULL,NULL,NULL,NULL,NULL,"",NULL
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.",NULL,NULL,NULL,NULL,NULL,NULL,NULL,"mait@yahoo.com",NULL
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J",NULL,NULL,NULL,NULL,NULL,NULL,"",NULL
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.",NULL,NULL,NULL,NULL,NULL,NULL,NULL,"mait@yahoo.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul",NULL,"","","","",NULL,NULL,"psd@gmail.com",NULL
输出是:
$ python parse-csv.py -i sample2.csv -o sample2-out.csv
$ python unique.py -i sample2-out.csv -o sample2-final.csv
$ cat sample2-final.csv
"id","lastname","firstname","middlename","address1","address2","city","zipcode","city2","zipcode2","emailaddress","website"
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J","NULL","NULL","NULL","NULL","NULL","NULL","","NULL"
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.","NULL","NULL","NULL","NULL","NULL","NULL","NULL","mait@yahoo.com","NULL"
"E387F3C1-F6E9-40DD-86AB-A7149C67F61C","Technical Support","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL","NULL"
"648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J","NULL","NULL","NULL","NULL","NULL","NULL","","NULL"
"A94FAD4E-27DB-48FE-B89E-C37B408C5DD5","Mait","A.V.","NULL","NULL","NULL","NULL","NULL","NULL","NULL","mait@yahoo.com","NULL"
"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul","NULL","","","","","NULL","NULL","psd@gmail.com","NULL"
为什么它不能像 sample.csv 那样正确获取唯一的行和列????
有人有什么想法吗?
提前致谢!咀嚼这个很久了....
【问题讨论】:
你有完全相同的"FDFCA22A-EE19-4997-B892-90B2006FE328","Drago","Paul"
行。它不应该合并为一行,完全相同的数据吗?
是的,这行有效,但为什么不能用于其他行,即 "648EEB5D-0586-444A-B86F-4EB2446BBC93","Palm","Samuel","J" ....困惑......它适用于sample.csv
Palm Samuel 的线条也一样...
是的,如果该人有一列有很多行,则输出应该只有唯一行,每列具有唯一值
【参考方案1】:
您的第一个文件已排序,而第二个文件未排序。请看this discussion
你只需要这个:
t = list(t)
t[1:] = sorted(t[1:])
【讨论】:
没问题,很高兴它有帮助!【参考方案2】:这是我对您的问题的简单解决方案(据我所知),使用字典:
import csv
t = csv.reader(open("sample2.csv", 'rb'))
t = list(t)
def parsecsv(data):
# Assumes that the first column is the unique id and that the first
# row contains the column titles and that all rows have same # of columns
L = len(data[0])
csvDict =
for entry in data: # build a dict csvDict to represent data
if entry[0] in csvDict: # already have entry so add to it...
for i in range(L - 1): # loop through columns
if csvDict[entry[0]][i] != 'NULL': #check if data exists in column
if (entry[i] not in csvDict[entry[0]][i]) and (entry[i] != 'NULL'):
csvDict[entry[0]][i] += '|' + entry[i]
else:
csvDict[entry[0]][i] = entry[i]
else:
csvDict[entry[0]] = [None]*(L - 1)
for i in range(L - 1): # loop through columns
csvDict[entry[0]][i] = entry[i]
return csvDict
out = parsecsv(t)
for entry in out:
print entry + ' = ' + str(out[entry])
这应该独立于排序的数据集等......
如果有帮助请告诉我!
【讨论】:
以上是关于列中每行具有唯一值的 Python/CSV 唯一行的主要内容,如果未能解决你的问题,请参考以下文章