python：删除重复的文本行组

Question

我知道如何从文本中删除重复的行和重复的字符，但我正在尝试在python3中完成更复杂的事情。我的文本文件可能包含也可能不包含在每个文本文件中重复的行组。我想编写一个python实用程序，它将找到这些重复的行块，并删除除第一个找到的所有行。

例如，假设file1包含以下数据：

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.

Now is the time
for all good men
to come to the aid of their party.

Now is the time
for all good men
to come to the aid of their party.

That's all, folks.

我希望以下是这种转变的结果：

Now is the time
for all good men
to come to the aid of their party.

This is some other stuff.

And this is even different stuff.




That's all, folks.

我还希望在从文件开头以外的某处开始找到重复的行组时使用此功能。假设file2看起来像这样：

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
All around
the mulberry bush
the monkey chased the weasel.
... and this is another phrase.

All around
the mulberry bush
the monkey chased the weasel.

End

对file2来说，这应该是转型的结果：

This is some text.

This is some other text,
as is this.

All around
the mulberry bush
the monkey chased the weasel.

Here is some more random stuff.
... and this is another phrase.


End

需要明确的是，在运行此所需实用程序之前，可能不知道可能重复的行组。算法必须自己识别这些重复的行组。

我确信，只要有足够的工作和足够的时间，我终于可以提出我正在寻找的算法。但我希望有人可能已经解决了这个问题，并将结果公布在某个地方。我一直在寻找并没有找到任何东西，但也许我忽略了一些东西。

附录：我需要增加更多清晰度。行组必须是最大的组，每组必须包含至少2行。

例如，假设file3看起来像这样：

line1 line1 line1
line2 line2 line2
line3 line3 line3

other stuff

line1 line1 line1
line3 line3 line3
line2 line2 line2

在这种情况下，所需的算法不会删除任何行。

另一个例子，在file4：

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

我正在寻找的结果是这样的：

abc def ghi
jkl mno pqr

line1 line1 line1
line2 line2 line2
line3 line3 line3
abc def ghi
line1 line1 line1
line2 line2 line2
line3 line3 line3
line4 line4 line4
qwerty
asdfghj
line1 line1 line1
line2 line2 line2
line3 line3 line3
lkjhgfd
line2 line2 line2
line3 line3 line3
line4 line4 line4
wxyz

换句话说，由于4行组（“line1 ... line2 ... line3 ... line4 ...”）是最大的重复组，因此是唯一被删除的组。

我总是可以重复这个过程，直到文件不变，如果我想要删除较小的重复组。