Python:比较 2 个 csv 文件中的 3 列,如果相等则输出
Posted
技术标签:
【中文标题】Python:比较 2 个 csv 文件中的 3 列,如果相等则输出【英文标题】:Python: compare 3 columns in 2 csv files and output if they are equal 【发布时间】:2018-03-06 14:54:19 【问题描述】:所以我有两个 CSV 文件,我试图比较它们并获得相似项目的结果。第一个文件 hosts.csv 如下所示:
Path Filename Size Signature
C:\ a.txt 14kb 012345
D:\ b.txt 99kb 678910
C:\ c.txt 44kb 111213
第二个文件masterlist.csv如下所示:
Filename Signature
b.txt 678910
x.txt 111213
b.txt 777777
c.txt 999999
如您所见,行不匹配,masterlist.csv 始终大于 hosts.csv 文件。我想搜索的唯一部分是签名部分。我知道这看起来像:
主机[3] == 主列表[1] 我正在寻找一种解决方案,它会给我以下内容(基本上是带有新 RESULTS 列的 hosts.csv 文件):
Path Filename Size Signature RESULTS
C:\ a.txt 14kb 012345 NOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
我已经搜索了帖子,并在这里找到了类似的内容,但我不太了解它,因为我还在学习python。
使用 Python 3.5 编辑
【问题讨论】:
【参考方案1】:我总是更喜欢 pandas 数据框来做这些事情,因为它提供了广泛的不同功能来保存和编辑.csv
-files。 Pandas
df = pd.DataFrame.from_csv('1.csv')
df2 = pd.DataFrame.from_csv('2.csv')
df['result'] = 0
for i in xrange(df['signature'].__len__()):
for j in xrange(df2['signature'].__len__()):
if df['signature'][i] == df2['signature'][j]:
df.loc[i, ('result')] = 'found in \'2.csv\' at row ' + str(
df2.signature[df2.signature == df2['signature'][j]].index.tolist())
break
df.to_csv('out.csv')
其中1.csv
= hosts.csv
和2.csv
= masterlist.csv
,并将整个输出保存为out.csv
。输出如下:
path filename signature result
0 C:\ a.txt 12345 0
1 D:\ b.txt 678910 found in '2.csv' at row [0]
2 C:\ c.txt 111213 found in '2.csv' at row [1, 4]
我的.csv
-文件如下所示。
第一:1.csv
path filename signature
0 C:\ a.txt 12345
1 D:\ b.txt 678910
2 C:\ c.txt 111213
第二:2.csv
filename signature
0 b.txt 678910
1 x.txt 111213
2 b.txt 777777
3 c.txt 999999
4 b.txt 111213
所以我可以查看2.csv
中的签名是否多次出现,并保存在哪里可以找到它们。
【讨论】:
【参考方案2】:使用csv.DictReader
和csv.DictWriter
对象的解决方案:
import csv
with open('hosts.csv', 'r') as hosts, open('masterlist.csv', 'r') as mlist, \
open('result.csv', 'w', newline='') as res:
host_reader = csv.DictReader(hosts, delimiter=' ', skipinitialspace=True)
mlist_reader = csv.DictReader(mlist, delimiter=' ', skipinitialspace=True)
writer = csv.DictWriter(res, fieldnames=host_reader.fieldnames + ['Result'], delimiter='\t')
mlist_data = r['Signature']: mlist_reader.line_num-1 for r in mlist_reader
fmt = '0FOUND in masterlist1' # prepearing output format for `Result` field
writer.writeheader() # writing header
for r in host_reader:
if r['Signature'] in mlist_data:
r['Result'] = fmt.format(""," (row "+str(mlist_data[r['Signature']])+")")
else:
r['Result'] = fmt.format("NOT ","")
writer.writerow(r)
result.csv
内容:
Path Filename Size Signature Result
C:\ a.txt 14kb 012345 NOT FOUND in masterlist
D:\ b.txt 99kb 678910 FOUND in masterlist (row 1)
C:\ c.txt 44kb 111213 FOUND in masterlist (row 2)
【讨论】:
【参考方案3】:你可以试试这个:
import csv
masterlist = list(csv.reader(open('masterlist.csv')))
host = list(csv.reader(open('host.csv')))
masterlist_dict = a:b for a, b in zip(["Filename", "Signature"], masterlist)
final_result = [["Path", "Filename", "Size","Signature", "RESULTS"]] +
[[path, filename, size, signature, "NOT FOUND"]
if signature in masterlist_dict["Signature"]
else [path, filename, size, signature,
"FOUND (row )".format(
masterlist_dict["Signature"].index(signature)
for path, filename, size, signature in host]
write = csv.writer(open("new_host.csv", 'a')))
write.writerows(final_results)
【讨论】:
以上是关于Python:比较 2 个 csv 文件中的 3 列,如果相等则输出的主要内容,如果未能解决你的问题,请参考以下文章
for循环中的2个csv文件,输出一个csv文件python