我有 R 代码从一个文档中提取信息。如何为我的文件夹中的所有文档循环?

Posted

技术标签:

【中文标题】我有 R 代码从一个文档中提取信息。如何为我的文件夹中的所有文档循环?【英文标题】:I have R code to extract information from one document. How do I loop that for all the documents in my folder? 【发布时间】:2022-01-13 13:10:23 【问题描述】:

我有一个 txt 文件的文件夹,我想从中提取特定文本并将它们分开列排列到一个新的数据框中。我为一个文件编写了代码,但我似乎无法将其编辑成一个循环,该循环将在我的文件夹中的所有文档中运行。

这是我的一个 txt 文件的代码:

    clean_text <- as.data.frame(strsplit(text$text, '\\*' ), col.names = "text") %>% 
mutate(text = str_replace_all(text, "\n", " "),
         text = str_replace_all(text, "- ", ""), 
         text = str_replace_all(text,"^\\s", "")) %>% 
  
  filter(!text == " ") %>% 
  
  mutate(paragraphs = ifelse(grepl("^[[:digit:]]", text) == T, text, NA)) %>% 
  
  rename(category = text) %>% 
  mutate(category = ifelse(grepl("^[[:digit:]]", category) == T, NA, category)) %>% 
  fill(category) %>% 
  filter(!is.na(paragraphs)) %>% 
  
  mutate(paragraphs = strsplit(paragraphs, '^[[:digit:]]1,3\\.|\\t\\s[[:digit:]]1,3\\.')) %>% 
  unnest(paragraphs) %>% 
  mutate(paragraphs = strsplit(paragraphs, 'Download as PDF')) %>%
  unnest(paragraphs) %>% 
  mutate(paragraphs = str_replace_all(paragraphs, "\t", "")) %>% 
  mutate(paragraphs = ifelse(grepl("javascript", paragraphs), "", paragraphs)) %>%
  mutate(paragraphs = str_replace_all(paragraphs, "^\\s+", "")) %>%
  filter(!paragraphs == "") 

我如何使它成为一个循环?我意识到有类似的问题,但没有一个解决方案对我有用。提前感谢您的帮助!

【问题讨论】:

【参考方案1】:

将代码放入函数中:

extract_info = function(file) 
  ## Add the code you need to read the text from the file
  ## Something like
  ## text <- readLines(file)
  ## or whatever you are using to read in the file
  clean_text <- as.data.frame(strsplit(text$text, '\\*' ), col.names = "text") %>% 
  mutate(text = str_replace_all(text, "\n", " "),
           text = str_replace_all(text, "- ", ""), 
           text = str_replace_all(text,"^\\s", "")) %>% 
    
    filter(!text == " ") %>% 
    
    mutate(paragraphs = ifelse(grepl("^[[:digit:]]", text) == T, text, NA)) %>% 
    
    rename(category = text) %>% 
    mutate(category = ifelse(grepl("^[[:digit:]]", category) == T, NA, category)) %>% 
    fill(category) %>% 
    filter(!is.na(paragraphs)) %>% 
    
    mutate(paragraphs = strsplit(paragraphs, '^[[:digit:]]1,3\\.|\\t\\s[[:digit:]]1,3\\.')) %>% 
    unnest(paragraphs) %>% 
    mutate(paragraphs = strsplit(paragraphs, 'Download as PDF')) %>%
    unnest(paragraphs) %>% 
    mutate(paragraphs = str_replace_all(paragraphs, "\t", "")) %>% 
    mutate(paragraphs = ifelse(grepl("javascript", paragraphs), "", paragraphs)) %>%
    mutate(paragraphs = str_replace_all(paragraphs, "^\\s+", "")) %>%
    filter(!paragraphs == "") 

测试您的函数以确保它适用于一个文件:

extract_info("your_file_name.txt")
## does the result work and look right? 
## work on your function until it does

获取您要运行的所有文件的列表

my_files = list.files()
## by default this will give you all the files in your working directory
## use the `pattern` argument if you only want files that follow
## a certain naming convention

将您的函数应用于这些文件:

results = lapply(my_files, extract_info)

【讨论】:

【参考方案2】:

我没有使用循环,而是使用 lapply 并且函数具有与循环相同的行为:

my_path <- "C:/Users/SAID ABIDI/Desktop/test/"
my_a <- list.files(path = my_path)

my_function <- function(x) 
  read_file(paste(my_path, my_a[x], sep = ""))

my_var <- lapply(1:length(my_a), my_function)

这对你有帮助吗?

【讨论】:

嗨,所以我尝试了你的方法,但它返回:错误:'/Users/m.iero/accessioncommitments/text_filesafghanistan_commitments.txt' 不存在。这很奇怪,因为我将路径作为整个文件夹('/Users/m.iero/accessioncommitments/text_files),而不仅仅是一个 txt 文件。它对你有用吗?

以上是关于我有 R 代码从一个文档中提取信息。如何为我的文件夹中的所有文档循环?的主要内容,如果未能解决你的问题,请参考以下文章

如何为所有 SpecFlow 功能重新生成设计器代码

如何为自定义JVM语言实现静态代码分析工具的类型信息?

R:如何为预测模型制作混淆矩阵?

如何为以下操作创建 SQL 查询?

如何为嵌入文件编写mongo查询

有没有办法从 javascript 文档中提取列表?