使用python脚本从csv文件中删除重复的行

Posted 2021-05-05

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了使用python脚本从csv文件中删除重复的行相关的知识，希望对你有一定的参考价值。

目标

我从hotmail下载了一个CSV文件，但它有很多重复项。这些副本是完整的副本，我不知道为什么我的手机创建它们。

我想摆脱重复。

途径

编写一个python脚本来删除重复项。

技术规格


Windows XP SP 3
Python 2.7
CSV file with 400 contacts

答案

更新：2016年

如果您乐意使用有用的more_itertools外部库：

from more_itertools import unique_everseen
with open('1.csv','r') as f, open('2.csv','w') as out_file:
    out_file.writelines(unique_everseen(f))

@ IcyFlame解决方案的更高效版本

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)

要就地编辑同一个文件，您可以使用它

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line in seen: continue # skip duplicate

    seen.add(line)
    print line, # standard output is now redirected to the file

另一答案

您可以使用Pandas有效地实现重复数据删除：

import pandas as pd
file_name = "my_file_with_dupes.csv"
file_name_output = "my_file_without_dupes.csv"

df = pd.read_csv(file_name, sep="	 or ,")

# Notes:
# - the `subset=None` means that every column is used 
#    to determine if two rows are different; to change that specify
#    the columns as an array
# - the `inplace=True` means that the data structure is changed and
#   the duplicate rows are gone  
df.drop_duplicates(subset=None, inplace=True)

# Write the results to a different file
df.to_csv(file_name_output)

另一答案

您可以使用以下脚本：

前提：

1.csv是包含重复项的文件
2.csv是输出文件，一旦执行此脚本，将缺少重复项。

码



inFile = open('1.csv','r')

outFile = open('2.csv','w')

listLines = []

for line in inFile:

    if line in listLines:
        continue

    else:
        outFile.write(line)
        listLines.append(line)

outFile.close()

inFile.close()

算法解释

在这里，我正在做的是：

在读取模式下打开文件。这是具有重复项的文件。
然后在一个循环中运行直到文件结束，我们检查该行是否已经遇到过。
如果遇到过，我们不会将其写入输出文件。
如果不是，我们会将其写入输出文件并将其添加到已经遇到的记录列表中

另一答案

@ jamylak解决方案的更有效版本:(少一条指令）

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line not in seen: 
            seen.add(line)
            out_file.write(line)

要就地编辑同一个文件，您可以使用它

import fileinput
seen = set() # set for fast O(1) amortized lookup
for line in fileinput.FileInput('1.csv', inplace=1):
    if line not in seen:
        seen.add(line)
        print line, # standard output is now redirected to the file

以上是关于使用python脚本从csv文件中删除重复的行的主要内容，如果未能解决你的问题，请参考以下文章