如何在目录中的所有 csvs 文件中进行 python 关键字搜索和单词计数器并写入单个 csv? [关闭]
Posted
技术标签:
【中文标题】如何在目录中的所有 csvs 文件中进行 python 关键字搜索和单词计数器并写入单个 csv? [关闭]【英文标题】:How can I do a python keyword search and word counter within all csvs files in directory and write to a single csv? [closed] 【发布时间】:2021-07-21 01:03:13 【问题描述】:我是 python 新手并试图了解某些库。不确定如何将 csv 上传到 SO,但此脚本适用于任何 csv,只需替换“SwitchedProviders_TopicModel”
我的目标是遍历文件目录 - C:\Users\jj\Desktop\autotranscribe 中的所有 csv,并将我的 python 脚本输出按文件写入 csv。
比如说,我在上面的文件夹中有这些 csv 文件-
'1003391793_1003391784_01bc7e411408166f7c5468f0.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv' '1003478130_1003478103_8eef05b0820cf0ffe9a9882d.csv'
我希望我的 python 应用程序(如下)为文件夹/目录中的每个 csv 做一个字数计数器,并将输出写入这样的数据帧 -
csvname pre existing exclusions limitations fourteen
1003391793_1003391784_01bc7e411408166f7c5468f0.csv 1 2 0 1
我的脚本 -
import pandas as pd
from collections import defaultdict
def search_multiple_strings_in_file(file_name, list_of_strings):
"""Get line from the file along with line numbers, which contains any string from the list"""
line_number = 0
list_of_results = []
count = defaultdict(lambda: 0)
# Open the file in read only mode
with open("SwitchedProviders_TopicModel.csv", 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
line_number += 1
# For each line, check if line contains any string from the list of strings
for string_to_search in list_of_strings:
if string_to_search in line:
count[string_to_search] += line.count(string_to_search)
# If any string is found in line, then append that line along with line number in list
list_of_results.append((string_to_search, line_number, line.rstrip()))
# Return list of tuples containing matched string, line numbers and lines where string is found
return list_of_results, dict(count)
matched_lines, count = search_multiple_strings_in_file('SwitchedProviders_TopicModel.csv', [ 'pre existing ', 'exclusions','limitations','fourteen'])
df = pd.DataFrame.from_dict(count, orient='index').reset_index()
df.columns = ['Word', 'Count']
print(df)
我怎么能做到这一点?只寻找一个计数器特定的词,如你在我的脚本中看到的“十四”,而不是寻找所有词的计数器
其中一个 csvs 的样本数据 - 信用用户 Umar H
df = pd.read_csv('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv')
print(df.head(10).to_dict())
'transcript': 0: 'hi thanks for calling ACCA this is many speaking could have the pleasure speaking with ', 1: 'so ', 2: 'hi ', 3: 'I have the pleasure speaking with my name is B. as in boy E. V. D. N. ', 4: 'thanks yes and I think I have your account pulled up could you please verify your email ', 5: "sure is yeah it's on _ 00 ", 6: 'I T. O.com ', 7: 'thank you how can I help ', 8: 'all right I mean I do have an insurance with you guys I just want to cancel the insurance ', 9: 'sure I can help with that what was the reason for cancellation ', 'confidence': 0: 0.73, 1: 0.18, 2: 0.88, 3: 0.72, 4: 0.83, 5: 0.76, 6: 0.83, 7: 0.98, 8: 0.89, 9: 0.95, 'from': 0: 1.69, 1: 1.83, 2: 2.06, 3: 2.13, 4: 2.36, 5: 2.98, 6: 3.17, 7: 3.65, 8: 3.78, 9: 3.93, 'to': 0: 1.83, 1: 2.06, 2: 2.13, 3: 2.36, 4: 2.98, 5: 3.17, 6: 3.65, 7: 3.78, 8: 3.93, 9: 4.14, 'speaker': 0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 'Negative': 0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.116, 9: 0.0, 'Neutral': 0: 0.694, 1: 1.0, 2: 1.0, 3: 0.802, 4: 0.603, 5: 0.471, 6: 1.0, 7: 0.366, 8: 0.809, 9: 0.643, 'Positive': 0: 0.306, 1: 0.0, 2: 0.0, 3: 0.198, 4: 0.397, 5: 0.529, 6: 0.0, 7: 0.634, 8: 0.075, 9: 0.357, 'compound': 0: 0.765, 1: 0.0, 2: 0.0, 3: 0.5719, 4: 0.7845, 5: 0.5423, 6: 0.0, 7: 0.6369, 8: -0.1779, 9: 0.6124
【问题讨论】:
你能添加一个你正在解析的文件的样本吗?不是 100% 清楚您要搜索的内容。 嘿@Umar.H 感谢您的回复。不知道如何在此处附加 csv 文件,这对 SO 来说相当新。请告诉我,我也会这样做。 您可以将字段的前 10 行添加为将其粘贴为文本,或者您可以将其加载到 pandas 数据框df = pd.read_csv(your_file)
然后运行 print(df.head(10).to_dict())
并将输出粘贴到此处以便我们可以重现你的文件。
嗨@Umar.H 我和你提到的一样。请查看已编辑的问题
@Umar.H 是的,单词计数器应该只在成绩单列中查找单词
【参考方案1】:
步骤-
-
定义输入路径
提取所有 CSV 文件
计数
创建 1 个结果字典,添加文件名和计数器字典。
最后,将生成的 dict 转换为数据帧并转置。 (如果需要,用 0 填充 NAN 值)
import string
from collections import Counter, defaultdict
from pathlib import Path
import pandas as pd
inp_dir = Path(r'C:/Users/jj/Desktop/Bulk_Wav_Completed') # current dir
def search_multiple_strings_in_file(file_name, list_of_strings):
"""Get line from the file along with line numbers, which contains any string from the list"""
list_of_results = []
count = defaultdict(lambda: 0)
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line_number, line in enumerate(read_obj, start=1):
# For each line, check if line contains any string from the list of strings
for string_to_search in list_of_strings:
if string_to_search in line:
count[string_to_search] += line.count(string_to_search)
# If any string is found in line, then append that line along with line number in list
list_of_results.append(
(string_to_search, line_number, line.rstrip()))
# Return list of tuples containing matched string, line numbers and lines where string is found
return list_of_results, dict(count)
result =
for csv_file in inp_dir.glob('**/*.csv'):
print(csv_file) # for debugging
matched_lines, count = search_multiple_strings_in_file(csv_file, ['nation', 'nation wide', 'trupanion', 'pet plan', 'best', 'embrace', 'healthy paws', 'pet first', 'pet partners', 'lemon',
'AKC', 'akc', 'kennel club', 'club', 'american kennel', 'american', 'lemonade'
'kennel', 'figo', 'companion protect', 'true companion',
'true panion', 'trusted pals', 'partners' 'lemonade', 'partner',
'wagmo', 'vagmo', 'bivvy', 'bivy', 'bee' '4paws', 'paws', 'pet best',
'pets best', 'pet best'])
print(count) # for debugging
result[csv_file.name] = count
df = pd.DataFrame(result).T.fillna(0).astype(int)
输出 -
exclusions limitations pre existing
1.csv 1 3 1
2.csv 1 3 1
【讨论】:
我只是在寻找我在 python 脚本中提到的某些单词。所以我不需要所有 csv 中的所有单词的单词计数器 @JayTaggert 我已经更新了我的答案以过滤所需的单词。 所有字数显示为 0,仅显示两个字,限制和排除,而不是 4 @JayTaggert 现在尝试一次。我已经使用你的函数来评估计数 输出现在有 0 列,并表示索引显示了我所有的 csv。不幸的是,也没有字数统计值【参考方案2】:由于您已标记 pandas,我们可以使用 .str.extractall
来搜索单词和行号。
您可以扩展函数并添加一些错误处理(例如,如果给定的 csv 文件中不存在转录本会发生什么)。
from pathlib import Path
import pandas as pd
def get_files_to_parse(start_dir : str) -> list:
files = [f for f in Path(start_dir).glob('*.csv')]
return files
def search_multiple_files(list_of_paths : list,key_words : list) -> pd.DataFrame:
dfs = []
for file in list_of_paths:
df = pd.read_csv(file)
word_df = df['transcript'].str.extractall(f"('|'.join(key_words))")\
.droplevel(1,0)\
.reset_index()\
.rename(columns='index' : f"file.parent_file.stem")\
.set_index(0).T
dfs.append(word_df)
return pd.concat(dfs)
用法。
使用您的示例数据框(我从您的列表中添加了几个关键词)
files = get_files_to_parse('target\dir\folder')
[WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c.csv'),
WindowsPath('1003478130_1003478103_8eef05b0820cf0ffe9a9754c_copy.csv')]
search_multiple_files(files,['pre existing', 'exclusions','limitations','fourteen'])
【讨论】:
嗨@Umar 在我运行最后一行 search_multiple_files 后我收到错误文件“C:\Users\jj\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\ reshape\concat.py”,第 516 行,在 get_result indexers[ax] = obj_labels.get_indexer(new_labels) 文件“C:\Users\jj\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\ index\base.py”,第 3172 行,在 get_indexer “重新索引仅对唯一值索引对象有效” InvalidIndexError:重新索引仅对唯一值索引对象有效 @JayTaggert 很可能你有重复的文件名,(我假设在不同的目录中?) 是的,我愿意。有没有办法解决这个问题? @JayTaggert 没有经过测试,但是在文件名之前添加父级应该可以工作。 嗨,奥马尔。感谢您的答复。我是 python 新手,不确定在文件名前添加“父级”一词是什么意思。如果您可以编辑您的答案,我将非常感谢以上是关于如何在目录中的所有 csvs 文件中进行 python 关键字搜索和单词计数器并写入单个 csv? [关闭]的主要内容,如果未能解决你的问题,请参考以下文章