Python:计算txt文件目录中的单词并将单词计数写入单独的txt文件

Posted

技术标签:

【中文标题】Python:计算txt文件目录中的单词并将单词计数写入单独的txt文件【英文标题】:Python: Counting words from a directory of txt files and writing word counts to a separate txt file 【发布时间】:2021-06-25 23:28:08 【问题描述】:

Python 新手,我正在尝试计算文本文件目录中的单词并将输出写入单独的文本文件。但是,我想指定条件。因此,如果字数 > 0,则想将计数和文件路径写入一个文件,如果计数 == 0。我想将计数和文件路径写入单独的文件。以下是我到目前为止的代码。我想我已经接近了,但我对如何处理条件和分离文件感到困惑。谢谢。

import sys
import os
from collections import Counter
import glob

stdoutOrigin=sys.stdout 
sys.stdout = open("log.txt", "w")
              
def count_words_in_dir(dirpath, words, action=None):
    for filepath in glob.iglob(os.path.join("path", '*.txt')):
        with open(filepath) as f:
            data = f.read()
            for key,val in words.items():
            #print("key is " + key + "\n")
                ct = data.count(key)
                words[key] = ct
            if action:
                 action(filepath, words)
            
                
                


def print_summary(filepath, words):
    for key,val in sorted(words.items()):
        print(filepath)
        if val > 0:
            print('0:\t1'.format(
            key,
            val))
        







filepath = sys.argv[1]
keys = ["x", "y"]
words = dict.fromkeys(keys,0)

count_words_in_dir(filepath, words, action=print_summary)

sys.stdout.close()
sys.stdout=stdoutOrigin

【问题讨论】:

stdoutOrigin=sys.stdoutsys.stdout = open("log.txt", "w") 你到底为什么要monkey_patch 这个而不是使用文件记录器???还是直接打开输出文件直接写入!? 【参考方案1】:

我强烈建议您不要将 stdout 重新用于将数据写入文件作为程序正常过程的一部分。我也想知道你怎么能有一个单词“count

您的代码的主要问题在于这一行:

for filepath in glob.iglob(os.path.join("path", '*.txt')):

字符串常量"path" 我很确定不属于那里。我想你想要filepath 那里。我认为这个问题会阻止你的代码工作。

这是您的代码版本,我在其中修复了这些问题,并添加了根据计数写入两个不同输出文件的逻辑:

import sys
import os
import glob

out1 = open("/tmp/so/seen.txt", "w")
out2 = open("/tmp/so/missing.txt", "w")

def count_words_in_dir(dirpath, words, action=None):
    for filepath in glob.iglob(os.path.join(dirpath, '*.txt')):
        with open(filepath) as f:
            data = f.read()
            for key, val in words.items():
                # print("key is " + key + "\n")
                ct = data.count(key)
                words[key] = ct
            if action:
                action(filepath, words)


def print_summary(filepath, words):
    for key, val in sorted(words.items()):
        whichout = out1 if val > 0 else out2
        print(filepath, file=whichout)
        print('0: 1'.format(key, val), file=whichout)

filepath = sys.argv[1]
keys = ["country", "friend", "turnip"]
words = dict.fromkeys(keys, 0)

count_words_in_dir(filepath, words, action=print_summary)

out1.close()
out2.close()

结果:

文件 seen.txt:

/Users/steve/tmp/so/dir/data2.txt
friend: 1
/Users/steve/tmp/so/dir/data.txt
country: 2
/Users/steve/tmp/so/dir/data.txt
friend: 1

文件丢失.txt:

/Users/steve/tmp/so/dir/data2.txt
country: 0
/Users/steve/tmp/so/dir/data2.txt
turnip: 0
/Users/steve/tmp/so/dir/data.txt
turnip: 0

(请原谅我使用了一些比你更有趣的搜索词)

【讨论】:

太好了。谢谢。它应该是 == 0,但我确实在编辑方面做得很好,“路径”常量也是如此。但这很好用。感谢您提供非常有用的评论,【参考方案2】:

您好,我希望我正确理解了您的问题,此代码将计算您的文件中有多少个不同的单词,并根据条件执行您想要的操作。

import os
all_words = 


def count(file_path):
    with open(file_path, "r") as f:
        # for better performance it is a good idea to go line by line through file
        for line in f:
            # singles out all the words, by splitting string around spaces
            words = line.split(" ")
            # and checks if word already exists in all_words dictionary...
            for word in words:
                try:
                    # ...if it does increment number of repetitions
                    all_words[word.replace(",", "").replace(".", "").lower()] += 1
                except Exception:
                    # ...if it doesn't create it and give it number of repetitions 1
                    all_words[word.replace(",", "").replace(".", "").lower()] = 1


if __name__ == '__main__':

    # for every text file in your current directory count how many words it has
    for file in os.listdir("."):
        if file.endswith(".txt"):
            all_words = 
            count(file)
            n = len(all_words)
            # depending on the number of words do something
            if n > 0:
                with open("count1.txt", "a") as f:
                    f.write(file + "\n" + str(n) + "\n")
            else:
                with open("count2.txt", "a") as f:
                    f.write(file + "\n" + str(n) + "\n")

如果您想多次计算同一个单词,您可以将字典中的所有值相加,或者您可以消除 try-except 块并计算那里的每个单词。

【讨论】:

谢谢。我会试试这个。

以上是关于Python:计算txt文件目录中的单词并将单词计数写入单独的txt文件的主要内容,如果未能解决你的问题,请参考以下文章

如何用python统计一个txt文件中各个单词出现的次数

如何从 txt 文件中读取特定的单词和数字并将它们保存在矩阵中

如何使用Python将模式过滤到另一个文件?

使用不同的文件替换一个单词 Bash

使用C读取并保存txt文件的每个单词?

PIG 脚本根据特定单词将大型文本文件拆分为多个部分