从pdf中提取数据到表格中[关闭]
Posted
技术标签:
【中文标题】从pdf中提取数据到表格中[关闭]【英文标题】:Extract data from a pdf into a table [closed] 【发布时间】:2022-01-18 18:31:11 【问题描述】:pdf example
我想从一个大的 pdf 文件(图像中的示例)中提取物种信息到一个列表中,其中每个物种作为一行,元数据作为列。有没有办法在 python 或 R 中做到这一点?
【问题讨论】:
【参考方案1】:另一种方法是简单地使用pdftool
库。
我的解决方案有两个部分:
-
将 1 个段落(物种)放入 data.frame 的一行中
将文本信息分离到meta.data列中
第 1 部分:每行数据框设置 1 个物种信息:
# get the path of the pdf:
file_name <- "species_info.pdf"
# read the text in the pdf:
species.raw.text <- pdf_text(pdf = file_name, opw = "", upw = "")
# split the text into part. Each corresponding to 1 species
species.raw.text <- str_split(species.raw.text, "\n\n")
# convert the list into a data.frame i.e. each row = 1 species
species.df <- as.data.frame(species.raw.text)
# change the column name to raw.text
colnames(species.df) <- c("raw.text")
第 2 部分:将原始文本中的信息提取到列中:
为此,我使用了 dplyr
库和 separate()
函数。我认为每个物种都有相同类型的信息,即
我建议使用此代码来获得您想要的:
library(dplyr)
# remove the `\n`
species.df$raw.text <- gsub("\n", " ", species.df$raw.text)
# get the meta.data
species.df <- species.df %>%
separate(
col = raw.text, sep = "Gulf of Suez:",
into = c("species.name", "rest")) %>%
separate(
col = rest, sep = "Gulf of Aqaba:",
into = c("Gulf.of.Suez", "rest")) %>%
separate(
col = rest, sep = "Red Sea main basin:",
into = c("Gulf.of.Aqaba", "rest")) %>%
separate(
col = rest, sep = "General distribution:",
into = c("Red.Sea.main.basin", "rest")) %>%
separate(
col = rest, sep = "Remark:", fill = "right",
into = c("General.distribution", "Remark"))
species.name | Gulf.of.Suez | Gulf.of.Aqaba | Red.Sea.main.basin | General.distribution | Remark |
---|---|---|---|---|---|
Carcharhinus albimarginatus (Rüppell 1837) | - | Israel (Baranes 2013). | Egypt (Rüppell 1837, as Carcharias albimarginatus), Sudan (Ninni 1931), Saudi Arabia (Spaet & Berumen 2015). | Red Sea, Indo-Pacific: East Africa east to Panama. | NA |
Carcharhinus altimus (Springer 1950) | - | Egypt (Baranes & Ben-Tuvia 1978a), Israel (Baranes & Golani 1993). | Saudi Arabia (Spaet & Berumen 2015). | Circumglobal in tropical and warm temperate seas. | NA |
Carcharhinus amboinensis (Müller & Henle 1839) | - | - | Saudi Arabia (Spaet & Berumen 2015). | Circumglobal in tropical and warm temperate seas, but not eastern Pacific. | NA |
Carcharhinus brevipinna (Müller & Henle 1839) | Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna). | - | Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna and Carcharhinus maculipinnis), Saudi Arabia (Spaet & Berumen 2015). | Circumglobal in tropical and warm temperate seas, but not in the eastern Pacific. | Not a Lessepsian migrant as previously reported by Ben-Tuvia (1966) (see Golani et al. 2002). |
Carcharhinus falciformis (Müller & Henle 1839) | - | - | Egypt (Gohar & Mazhar 1964, as Carcharhinus menisorrah), Saudi Arabia (Klausewitz 1959a, as Carcharhinus menisorrah; Spaet & Berumen 2015). | Circumglobal in tropical seas. | NA |
【讨论】:
感谢您的帮助,文档按姓氏组织(全部大写 - 我在原始帖子中添加了另一张图片)您知道如何处理吗?以上是关于从pdf中提取数据到表格中[关闭]的主要内容,如果未能解决你的问题,请参考以下文章
浏览pdf文件以查找特定页面并使用python从图像中提取表格数据