使用R将PDF文件转换为文本文件进行文本挖掘

Posted 2023-02-19

技术标签:

【中文标题】使用R将PDF文件转换为文本文件进行文本挖掘【英文标题】：Use R to convert PDF files to text files for text mining 【发布时间】：2014-02-22 02:47:49 【问题描述】：

我的文件夹中有近千篇 pdf 期刊文章。我需要在整个文件夹中的所有文章摘要上发短信给我。现在我正在做以下事情：

dest <- "~/A1.pdf"

# set path to pdftotxt.exe and convert pdf to text
exe <- "C:/Program Files (x86)/xpdfbin-win-3.03/bin32/pdftotext.exe"
system(paste("\"", exe, "\" \"", dest, "\"", sep = ""), wait = F)

# get txt-file name and open it
filetxt <- sub(".pdf", ".txt", dest)
shell.exec(filetxt)

由此，我将一个 pdf 文件转换为一个 .txt 文件，然后将摘要复制到另一个 .txt 文件中并手动编译。这项工作很麻烦。

如何从文件夹中读取所有单篇文章并将它们转换为仅包含每篇文章摘要的 .txt 文件。可以通过限制每篇文章中 ABSTRACT 和 INTRODUCTION 之间的内容来实现；但我不能这样做。任何帮助表示赞赏。

【问题讨论】：

这不是一个真正的 R 问题。您需要一个实用程序来从 pdf 文档中提取文本，这不是 R 的设计目标。我投票关闭是基于这样一个事实，即这是对此类工具的隐含调用。不完全是一个 R 问题；但本的回复对我很有帮助。谢谢。 How to export pdf form fields to xml automatically的可能重复 【参考方案1】：

是的，正如 IShouldBuyABoat 所指出的那样，不是真正的R 问题，而是R 只需轻微扭曲即可完成的问题...

使用R将PDF文件转换为txt文件...

# folder with 1000s of PDFs
dest <- "C:\\Users\\Desktop"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# convert each PDF file that is named in the vector into a text file 
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', 
             paste0('"', i, '"')), wait = FALSE) )

仅从 txt 文件中提取摘要...

# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
mytxtfiles <- list.files(path = dest, pattern = "txt",  full.names = TRUE)
abstracts <- lapply(mytxtfiles, function(i) 
  j <- paste0(scan(i, what = character()), collapse = " ")
  regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
)

将摘要写入单独的 txt 文件...

# write abstracts as txt files 
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts),  function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

现在您已经准备好对摘要进行一些文本挖掘了。

【讨论】：

非常感谢。这就是我一直在努力的。再次感谢。 “pdftotext.exe”是我们需要安装的软件吗？【参考方案2】：

我们可以使用库pdftools

library(pdftools)
# you can use an url or a path
pdf_url <- "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf"

# `pdf_text` converts it to a list
list_output <- pdftools::pdf_text('https://cran.r-project.org/web/packages/pdftools/pdftools.pdf')

# you get an element by page
length(list_output) # 5 elements for a 5 page pdf

# let's print the 5th
cat(list_output[[5]])
# Index
# pdf_attachments (pdf_info), 2
# pdf_convert (pdf_render_page), 3
# pdf_fonts (pdf_info), 2
# pdf_info, 2, 3
# pdf_render_page, 2, 3
# pdf_text, 2
# pdf_text (pdf_info), 2
# pdf_toc (pdf_info), 2
# pdftools (pdf_info), 2
# poppler_config (pdf_render_page), 3
# render (pdf_render_page), 3
# suppressMessages, 2
# 5

为了从文章中提取摘要，OP 选择在Abstract 和Introduction 之间提取内容。

我们将获取CRAN pdf 列表并提取作者作为Author 和Maintainer 之间的文本（我精心挑选了一些具有兼容格式的文本）。

为此，我们在我们的 url 列表上循环，然后提取内容，将所有文本折叠成每个 pdf 的文本，然后使用 regex 提取相关信息。

urls <- c(pdftools = "https://cran.r-project.org/web/packages/pdftools/pdftools.pdf",
          Rcpp     = "https://cran.r-project.org/web/packages/Rcpp/Rcpp.pdf",
          jpeg     = "https://cran.r-project.org/web/packages/jpeg/jpeg.pdf")

lapply(urls,function(url)
  list_output <- pdftools::pdf_text(url)
  text_output <- gsub('(\\s|\r|\n)+',' ',paste(unlist(list_output),collapse=" "))
  trimws(regmatches(text_output, gregexpr("(?<=Author).*?(?=Maintainer)", text_output, perl=TRUE))[[1]][1])
)

# $pdftools
# [1] "Jeroen Ooms"
# 
# $Rcpp
# [1] "Dirk Eddelbuettel, Romain Francois, JJ Allaire, Kevin Ushey, Qiang Kou, Nathan Russell, Douglas Bates and John Chambers"
# 
# $jpeg
# [1] "Simon Urbanek <Simon.Urbanek@r-project.org>"

【讨论】：

以上是关于使用R将PDF文件转换为文本文件进行文本挖掘的主要内容，如果未能解决你的问题，请参考以下文章