从pdf中提取数据到表格中[关闭]

Posted

技术标签:

【中文标题】从pdf中提取数据到表格中[关闭]【英文标题】:Extract data from a pdf into a table [closed] 【发布时间】:2022-01-18 18:31:11 【问题描述】:

pdf example

我想从一个大的 pdf 文件(图像中的示例)中提取物种信息到一个列表中,其中每个物种作为一行,元数据作为列。有没有办法在 python 或 R 中做到这一点?

【问题讨论】:

【参考方案1】:

另一种方法是简单地使用pdftool 库。

我的解决方案有两个部分:

    将 1 个段落(物种)放入 data.frame 的一行中 将文本信息分离到meta.data列中

第 1 部分:每行数据框设置 1 个物种信息:

# get the path of the pdf:
file_name <- "species_info.pdf"
# read the text in the pdf:
species.raw.text <- pdf_text(pdf = file_name, opw = "", upw = "")
# split the text into part. Each corresponding to 1 species
species.raw.text <- str_split(species.raw.text, "\n\n")
# convert the list into a data.frame i.e. each row = 1 species
species.df <- as.data.frame(species.raw.text)
# change the column name to raw.text
colnames(species.df) <- c("raw.text")

第 2 部分:将原始文本中的信息提取到列中:

为此,我使用了 dplyr 库和 separate() 函数。我认为每个物种都有相同类型的信息,即

物种名称 苏伊士湾: 亚喀巴湾: 红海主盆地: 一般分布: 备注:

我建议使用此代码来获得您想要的:

library(dplyr)
# remove the `\n`
species.df$raw.text <- gsub("\n", " ", species.df$raw.text)
# get the meta.data
species.df <- species.df %>% 
  separate(
    col = raw.text, sep = "Gulf of Suez:", 
    into = c("species.name", "rest")) %>%
  separate(
    col = rest, sep = "Gulf of Aqaba:", 
    into = c("Gulf.of.Suez", "rest")) %>%
  separate(
    col = rest, sep = "Red Sea main basin:", 
    into = c("Gulf.of.Aqaba", "rest")) %>%
  separate(
    col = rest, sep = "General distribution:", 
    into = c("Red.Sea.main.basin", "rest")) %>%
  separate(
    col = rest, sep = "Remark:", fill = "right",
    into = c("General.distribution", "Remark"))
species.name Gulf.of.Suez Gulf.of.Aqaba Red.Sea.main.basin General.distribution Remark
Carcharhinus albimarginatus (Rüppell 1837) - Israel (Baranes 2013). Egypt (Rüppell 1837, as Carcharias albimarginatus), Sudan (Ninni 1931), Saudi Arabia (Spaet & Berumen 2015). Red Sea, Indo-Pacific: East Africa east to Panama. NA
Carcharhinus altimus (Springer 1950) - Egypt (Baranes & Ben-Tuvia 1978a), Israel (Baranes & Golani 1993). Saudi Arabia (Spaet & Berumen 2015). Circumglobal in tropical and warm temperate seas. NA
Carcharhinus amboinensis (Müller & Henle 1839) - - Saudi Arabia (Spaet & Berumen 2015). Circumglobal in tropical and warm temperate seas, but not eastern Pacific. NA
Carcharhinus brevipinna (Müller & Henle 1839) Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna). - Egypt (Gohar & Mazhar 1964, as Aprionodon brevipinna and Carcharhinus maculipinnis), Saudi Arabia (Spaet & Berumen 2015). Circumglobal in tropical and warm temperate seas, but not in the eastern Pacific. Not a Lessepsian migrant as previously reported by Ben-Tuvia (1966) (see Golani et al. 2002).
Carcharhinus falciformis (Müller & Henle 1839) - - Egypt (Gohar & Mazhar 1964, as Carcharhinus menisorrah), Saudi Arabia (Klausewitz 1959a, as Carcharhinus menisorrah; Spaet & Berumen 2015). Circumglobal in tropical seas. NA

【讨论】:

感谢您的帮助,文档按姓氏组织(全部大写 - 我在原始帖子中添加了另一张图片)您知道如何处理吗?

以上是关于从pdf中提取数据到表格中[关闭]的主要内容,如果未能解决你的问题,请参考以下文章

从PDF python中提取/识别表[关闭]

浏览pdf文件以查找特定页面并使用python从图像中提取表格数据

如何从Java中提取PDF文件中的表格数据

从 pdf 中提取表格(到 excel),pref。带 vba

Excel表格问题,怎样从一个表格中自动提取其中一部分表格

提取PDF表格?方法很简单!