查找文本文件中至少有两个共同单词的所有行（Bash）

Posted 2023-03-15

技术标签:

【中文标题】查找文本文件中至少有两个共同单词的所有行（Bash）【英文标题】：Find all lines in a text file that have at least two words in common (Bash) 【发布时间】：2016-02-09 05:42:21 【问题描述】：

我有几个由不同人制作的大型文本文件。这些文件包含每行单个标题的列表。每个句子都是不同的，但据说是指相同的 - 未知 - 一组项目。

鉴于格式和措辞不同，我尝试生成一个较短的文件，其中包含可能的匹配项以供手动检查。我是 Bash 的新手，我尝试了几个命令来将每一行与具有两个或多个共同关键词的标题进行比较。应避免区分大小写，超过 4 个字符的关键字应排除文章等。

例子：

输入文本文件#1

Investigating Amusing King : Expl and/in the Proletariat
Managing Self-Confident Legacy: The Harlem Renaissance and/in the Abject
Inventing Sarcastic Silence: The Harlem Renaissance and/in the Invader
Inventing Random Ethos: The Harlem Renaissance and/in the Marginalized
Loss: Supplementing Transgressive Production and Assimilation

输入文本文件#2

Loss: Judging Foolhardy Historicism and Homosexuality
Loss: Developping Homophobic Textuality and Outrage
Loss: Supplement of transgressive production
Loss: Questioning Diligent Verbiage and Mythos
Me Against You: Transgressing Easygoing Materialism and Dialectic

输出文本文件

File #1-->Loss: Supplementing Transgressive Production and Assimilation
File #2-->Loss: Supplement of transgressive production

到目前为止，我已经能够清除一些具有完全相同条目的重复项...

cat FILE_num*.txt | sort | uniq -d > berbatim_duplicates.txt

...以及其他一些在括号中具有相同注释的人

  cat FILE_num*.txt | sort | cut -d "" -f2 | cut -d "" -f1 | uniq -d > same_annotations.txt

一个看起来很有前途的命令是用正则表达式找到的，但我没能让它工作。

提前致谢。

【问题讨论】：

我认为这个问题不太适合bash - 当然不是单行！考虑使用 Python 之类的脚本语言，这样您就可以更轻松地跟踪每个文件中的行。好的，你能不能给我一个例子或一些开始的指示。谢谢必须有两个共同的关键词，但在你的例子中“Supplement”=="Supplementing" @Labo 我认为常用词是Transgressive 和Production。如果关键字Transgressive和Production不止一行怎么办？ 【参考方案1】：

在 Python 3 中：

from sys import argv
from re import sub

def getWordSet(line):
    line=sub(r'\[.*\]|\(.*\)|[.,!?:]','',line).split()
    s=set()
    for word in line:
        if len(word)>4:
            word=word.lower()
            s.add(word)
    return s

def compare(file1, file2):
    file1 = file1.split('\n')
    file2 = file2.split('\n')
    for line1,set1 in zip(file1,map(getWordSet,file1)):
        for line2,set2 in zip(file2,map(getWordSet,file2)):
            if len(set1.intersection(set2))>1:
                print("File #1-->",line1,sep='')
                print("File #2-->",line2,sep='')

if __name__=='__main__':
    with open(argv[1]) as file1, open(argv[2]) as file2:
        compare(file1.read(),file2.read())

给出预期的输出。它显示文件的匹配行对。

将此脚本保存在一个文件中 - 我将其称为 script.py，但您可以随意命名。您可以使用

启动它

python3 script.py file1 file2

你甚至可以使用别名：

alias comp="python3 script.py"

然后

comp file1 file2

我包含了以下讨论中的功能。

【讨论】：

谢谢 Labo，但给了我一个错误：文件“find duplicates.py”，第 16 行 print("File #1-->",line1,sep='') ^ SyntaxError: invalid语法好的，我对 Python 的经验为零，所以在调查了一下之后，我用这个 print("File #1--> %s" % line1) 改变了你的打印，它工作得很好。谢谢！因为你用的是Python 2，如果没问题，我加命令行支持。经过一些测试，到目前为止唯一的问题是它现在应该避免括号之间的任何字符串。所以我会尝试使用 bash 从文件中删除它们。非常感谢！您有括号之间的字符串示例吗？

以上是关于查找文本文件中至少有两个共同单词的所有行（Bash）的主要内容，如果未能解决你的问题，请参考以下文章