python:删除重复的文本行组

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了python:删除重复的文本行组相关的知识,希望对你有一定的参考价值。

我知道如何从文本中删除重复的行和重复的字符,但我正在尝试在python3中完成更复杂的事情。我的文本文件可能包含也可能不包含在每个文本文件中重复的行组。我想编写一个python实用程序,它将找到这些重复的行块,并删除除第一个找到的所有行。

例如,假设file1包含以下数据:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.

我希望以下是这种转变的结果:

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.




That's all, folks.

我还希望在从文件开头以外的某处开始找到重复的行组时使用此功能。假设file2看起来像这样:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
All around
the mulberry bush
the monkey chased the weasel.
... and this is another phrase.

All around
the mulberry bush
the monkey chased the weasel.

End

file2来说,这应该是转型的结果:

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
... and this is another phrase.


End

需要明确的是,在运行此所需实用程序之前,可能不知道可能重复的行组。算法必须自己识别这些重复的行组。

我确信,只要有足够的工作和足够的时间,我终于可以提出我正在寻找的算法。但我希望有人可能已经解决了这个问题,并将结果公布在某个地方。我一直在寻找并没有找到任何东西,但也许我忽略了一些东西。

附录:我需要增加更多清晰度。行组必须是最大的组,每组必须包含至少2行。

例如,假设file3看起来像这样:

line1 line1 line1
line2 line2 line2
line3 line3 line3

other stuff

line1 line1 line1
line3 line3 line3
line2 line2 line2

在这种情况下,所需的算法不会删除任何行。

另一个例子,在file4

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

我正在寻找的结果是这样的:

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

换句话说,由于4行组(“line1 ... line2 ... line3 ... line4 ...”)是最大的重复组,因此是唯一被删除的组。

我总是可以重复这个过程,直到文件不变,如果我想要删除较小的重复组。

答案

我提出了以下解决方案。它可能仍然有一些无法解决的边缘情况,它可能不是最有效的方法,但至少在我的初步测试后,它似乎工作。

这个转发已经修复了我最初提交的版本中的一些错误。

欢迎任何改进建议。

# Remove all but the first occurrence of the longest                                                                            
# duplicated group of lines from a block of text.
# In this utility, a "group" of lines is considered
# to be two or more consecutive lines.                                                                             
#                                                                                                                               
# Much of this code has been shamelessly stolen from                                                                            
# https://programmingpraxis.com/2010/12/14/longest-duplicated-substring/                                                        

import sys

from itertools import starmap, takewhile, tee
from operator import eq, truth

# imap and izip no longer exist in python3 itertools.                                                                           
# These are simply equivalent to map and zip in python3.                                                                        
try:
    # python2 ...
    from itertools import imap
except ImportError:
    # python3 ...
    imap = map
try:
    # python2 ...
    from itertools import izip
except ImportError:
    # python3 ...
    izip = zip

def remove_longest_dup_line_group(text):
    if not text:
        return ''
    # Unlike in the original code, here we're dealing                                                                           
    # with groups of whole lines instead of strings                                                                              
    # (groups of characters). So we split the incoming                                                                          
    # data into a list of lines, and we then apply the                                                                          
    # algorithm to these lines, treating a line in the
    # same way that the original algorithm treats an
    # individual character.                                                                                                       
    lines = text.split('
')
    ld = longest_duplicate(lines)
    if not ld:
        return text
    tokens = text.split(ld)
    if len(tokens) < 1:
        # Defensive programming: this shouldn't ever happen,                                                                    
        # but just in case ...                                                                                                  
        return text
    return '{}{}{}'.format(tokens[0], ld, ''.join(tokens[1:]))

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return izip(a,b)

def prefix(a, b):
    count = sum(takewhile(truth, imap(eq, a, b)))
    if count < 2:
        # Blocks must consist of more than one line.
        return ''
    else:
        return '{}
'.format('
'.join(a[:count]))

def longest_duplicate(s):
    suffixes = (s[n:] for n in range(len(s)))
    return max(starmap(prefix, pairwise(sorted(suffixes))), key=len)

if __name__ == '__main__':
    text = sys.stdin.read()
    if text:
        # Use sys.stdout.write instead of print to
        # avoid adding an extra newline at the end.
        sys.stdout.write(remove_longest_dup_line_group(text))
    sys.exit(0)

以上是关于python:删除重复的文本行组的主要内容,如果未能解决你的问题,请参考以下文章

根据javascript中的文本行数更改textarea的高度[重复]

如何在 C++ 程序中的 2 个特定字符之间比较 2 个文件中的文本行

如何使用 bs4 或 lxml 在 Python 中找到 XML 标记的文本行?

什么是保存从图像分割的文本行的matlab代码

在 C++ 中计算文本文件中的文本行数时出错

合并 MSER 中的区域以识别 OCR 中的文本行