从pdf中提取数据到表格中[关闭]

Posted 2023-02-14

技术标签:

【中文标题】从pdf中提取数据到表格中[关闭]【英文标题】：Extract data from a pdf into a table [closed] 【发布时间】：2022-01-18 18:31:11 【问题描述】：

pdf example

我想从一个大的 pdf 文件（图像中的示例）中提取物种信息到一个列表中，其中每个物种作为一行，元数据作为列。有没有办法在 python 或 R 中做到这一点？

【问题讨论】：

【参考方案1】：

另一种方法是简单地使用pdftool 库。

我的解决方案有两个部分：

将 1 个段落（物种）放入 data.frame 的一行中将文本信息分离到meta.data列中

第 1 部分：每行数据框设置 1 个物种信息：

# get the path of the pdf:
file_name <- "species_info.pdf"
# read the text in the pdf:
species.raw.text <- pdf_text(pdf = file_name, opw = "", upw = "")
# split the text into part. Each corresponding to 1 species
species.raw.text <- str_split(species.raw.text, "\n\n")
# convert the list into a data.frame i.e. each row = 1 species
species.df <- as.data.frame(species.raw.text)
# change the column name to raw.text
colnames(species.df) <- c("raw.text")

第 2 部分：将原始文本中的信息提取到列中：

为此，我使用了 dplyr 库和 separate() 函数。我认为每个物种都有相同类型的信息，即

物种名称苏伊士湾：亚喀巴湾：红海主盆地：一般分布：备注：

我建议使用此代码来获得您想要的：

library(dplyr)
# remove the `\n`
species.df$raw.text <- gsub("\n", " ", species.df$raw.text)
# get the meta.data
species.df <- species.df %>% 
  separate(
    col = raw.text, sep = "Gulf of Suez:", 
    into = c("species.name", "rest")) %>%
  separate(
    col = rest, sep = "Gulf of Aqaba:", 
    into = c("Gulf.of.Suez", "rest")) %>%
  separate(
    col = rest, sep = "Red Sea main basin:", 
    into = c("Gulf.of.Aqaba", "rest")) %>%
  separate(
    col = rest, sep = "General distribution:", 
    into = c("Red.Sea.main.basin", "rest")) %>%
  separate(
    col = rest, sep = "Remark:", fill = "right",
    into = c("General.distribution", "Remark"))

species.name	Gulf.of.Suez	Gulf.of.Aqaba	Red.Sea.main.basin	General.distribution	Remark
Carcharhinus albimarginatus (Rüppell 1837)	-	Israel (Baranes 2013).	Egypt (Rüppell 1837, as Carcharias albimarginatus), Sudan (Ninni 1931), Saudi Arabia (Spaet & Berumen 2015).	Red Sea, Indo-Pacific: East Africa east to Panama.	NA
Carcharhinus altimus (Springer 1950)	-	Egypt (Baranes & Ben-Tuvia 1978a), Israel (Baranes & Golani 1993).	Saudi Arabia (Spaet & Berumen 2015).	Circumglobal in tropical and warm temperate seas.	NA
Carcharhinus amboinensis (Müller & Henle 1839)	-	-	Saudi Arabia (Spaet & Berumen 2015).	Circumglobal in tropical and warm temperate seas, but not eastern Pacific.	NA
Carcharhinus brevipinna (Müller & Henle 1839)	Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna).	-	Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna and Carcharhinus maculipinnis), Saudi Arabia (Spaet & Berumen 2015).	Circumglobal in tropical and warm temperate seas, but not in the eastern Pacific.	Not a Lessepsian migrant as previously reported by Ben-Tuvia (1966) (see Golani et al. 2002).
Carcharhinus falciformis (Müller & Henle 1839)	-	-	Egypt (Gohar & Mazhar 1964, as Carcharhinus menisorrah), Saudi Arabia (Klausewitz 1959a, as Carcharhinus menisorrah; Spaet & Berumen 2015).	Circumglobal in tropical seas.	NA

【讨论】：

感谢您的帮助，文档按姓氏组织（全部大写 - 我在原始帖子中添加了另一张图片）您知道如何处理吗？

以上是关于从pdf中提取数据到表格中[关闭]的主要内容，如果未能解决你的问题，请参考以下文章