从 PDF 中提取数据并添加到工作表

Posted 2023-02-24

技术标签:

【中文标题】从 PDF 中提取数据并添加到工作表【英文标题】：Extract Data from PDF and Add to Worksheet 【发布时间】：2016-07-16 04:17:48 【问题描述】：

我正在尝试将 PDF 文档中的数据提取到工作表中。 PDF 显示和文本可以手动复制并粘贴到 Excel 文档中。

我目前正在通过 SendKeys 执行此操作，但它不起作用。当我尝试粘贴 PDF 文档中的数据时出现错误。为什么我的粘贴不起作用？如果我在宏停止运行后粘贴，它会正常粘贴。

Dim myPath As String, myExt As String
Dim ws As Worksheet
Dim openPDF As Object
'Dim pasteData As MSForms.DataObject
Dim fCell As Range

'Set pasteData = New MSForms.DataObject
Set ws = Sheets("DATA")
If ws.Cells(ws.Rows.Count, "A").End(xlUp).Row > 1 Then Range("A3:A" & ws.Cells(ws.Rows.Count, "A").End(xlUp).Row).ClearContents

myExt = "\*.pdf"
'When Scan Receipts Button Pressed Scan the selected folder/s for receipts
For Each fCell In Range(ws.Cells(1, 1), ws.Cells(1, ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column))
    myPath = Dir(fCell.Value & myExt)
    Do While myPath <> ""
        myPath = fCell.Value & "\" & myPath
        Set openPDF = CreateObject("Shell.Application")
        openPDF.Open (myPath)
        Application.Wait Now + TimeValue("00:00:2")
        SendKeys "^a"
        Application.Wait Now + TimeValue("00:00:2")
        SendKeys "^c"
        'Application.Wait Now + TimeValue("00:00:2")
        ws.Select
        ActiveSheet.Paste
        'pasteData.GetFromClipboard

        'ws.Cells(3, 1) = pasteData.GetText
        Exit Sub

        myPath = Dir
    Loop

Next fCell

【问题讨论】：

作为遇到此问题并进行更多搜索的人，我发现这是 a valid reference link 作为替代方法。 【参考方案1】：

您可以使用 Adobe 库打开 PDF 文件并提取其内容（我相信您可以从 Adobe 下载它作为 SDK 的一部分，但它也带有某些版本的 Acrobat）

确保也将库添加到您的引用中（在我的机器上它是 Adobe Acrobat 10.0 类型库，但不确定这是否是最新版本）

即使使用 Adobe 库，它也不是微不足道的（您需要添加自己的错误捕获等）：

Function getTextFromPDF(ByVal strFilename As String) As String
   Dim objAVDoc As New AcroAVDoc
   Dim objPDDoc As New AcroPDDoc
   Dim objPage As AcroPDPage
   Dim objSelection As AcroPDTextSelect
   Dim objHighlight As AcroHiliteList
   Dim pageNum As Long
   Dim strText As String

   strText = ""
   If (objAvDoc.Open(strFilename, "") Then
      Set objPDDoc = objAVDoc.GetPDDoc
      For pageNum = 0 To objPDDoc.GetNumPages() - 1
         Set objPage = objPDDoc.AcquirePage(pageNum)
         Set objHighlight = New AcroHiliteList
         objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page
         Set objSelection = objPage.CreatePageHilite(objHighlight)

         If Not objSelection Is Nothing Then
            For tCount = 0 To objSelection.GetNumText - 1
               strText = strText & objSelection.GetText(tCount)
            Next tCount
         End If
      Next pageNum
      objAVDoc.Close 1
   End If

   getTextFromPDF = strText

End Function

这实际上与您尝试做的事情相同 - 仅使用 Adobe 自己的库。它一次一页地浏览 PDF，突出显示页面上的所有文本，然后将其（一次一个文本元素）放入字符串中。

请记住，您从中获得的内容可能充满各种非打印字符（换行符、换行符等），甚至可能出现在看起来像连续文本块的中间，因此您可能需要额外的代码来清理它，然后才能使用它。

希望有帮助！

【讨论】：

Adobe 库绝对是必经之路。需要注意的是，它可能比您所显示的更强大，因为它可以从文本框或其他对象的各个块中获取文本，这可能比仅抓取所有文本更简洁/对 OP 有帮助（或没有帮助）在页面上绝妙的答案！在我的 Excel 设置中，我必须添加一个名为“Adobe”的参考库才能使其正常工作。顺便说一句，您在If (objAvDoc.Open(strFilename, "") 行有一个错字。应该是If (objAvDoc.Open(strFilename, ""))。如果使用Option Explicit（应该如此），您还需要将此行添加到您的变量块：Dim tCount As Long。感谢@leowyn 提供此资源！看来 SDK 只能通过 Adobe 的许可获得。有没有一种无需支付 $$$$ 即可获得专业版的方法？小错字：If (objAvDoc.Open(strFilename, "") Then >> If objAvDoc.Open(strFilename, "") Then - 但在其他方面就像一个魅力，不需要广告。不需要额外需要安装，现在SDK自带Reader，看样子只需要在VBA项目的Tools | References中添加“Acrobat”即可。您需要安装 Adobe Acrobat Pro，Adobe Reader 无法使用。【参考方案2】：

我知道这是一个老问题，但我只是为工作中的一个项目做这个，我很惊讶没有人想到这个解决方案： 只需使用 Microsoft word 打开 .pdf。

当您尝试从 .docx 中提取数据时，该代码更易于使用，因为它在 Microsoft Word 中打开。 Excel 和 Word 可以很好地配合使用，因为它们都是 Microsoft 程序。就我而言，问题文件必须是 .pdf 文件。这是我想出的解决方案：

选择默认程序打开 .pdf 文件为 Microsoft Word 第一次使用 word 打开 .pdf 文件时，会弹出一个对话框，声称 word 需要将 .pdf 转换为 .docx 文件。点击左下角的“不再显示此消息”复选框，然后点击确定。创建一个从 .docx 文件中提取数据的宏。为此，我使用了MikeD's Code 作为资源。修改 MoveDown、MoveRight 和 Find.Execute 方法以满足您的任务需要。

是的，您可以将 .pdf 文件转换为 .docx 文件，但我认为这是一个更简单的解决方案。

【讨论】：

我收到 PDF 超出 word 支持的最大页面大小的错误。你知道吗？【参考方案3】：

随着时间的推移，我发现以结构化格式从 PDF 中提取文本是一项艰巨的任务。但是，如果您正在寻找一个简单的解决方案，您可能需要考虑 XPDF 工具 pdftotext。

提取文本的伪代码包括：

SHELL

下面的简化示例：

    Sub ReadIntoExcel(PDFName As String)
        'Convert PDF to text
        Shell "C:\Utils\pdftotext.exe -layout " & PDFName & " tempfile.txt"

        'Read in the text file and write to Excel
        Dim TextLine as String
        Dim RowNumber as Integer
        Dim F1 as Integer
        RowNumber = 1
        F1 = Freefile()
        Open "tempfile.txt" for Input as #F1
            While Not EOF(#F1)
                Line Input #F1, TextLine
                ThisWorkbook.WorkSheets(1).Cells(RowNumber, 1).Value = TextLine
                RowNumber = RowNumber + 1
            Wend
        Close #F1
    End Sub

【讨论】：

【参考方案4】：

通过用户交互模拟进行复制和粘贴可能不可靠（例如，出现弹出窗口并切换焦点）。您可能有兴趣尝试商业 ByteScout PDF Extractor SDK，它专门设计用于从 PDF 中提取数据，并且可以在 VBA 中工作。它还能够使用VB code从发票和表格中提取数据作为CSV。

这是 Excel 的 VBA 代码，用于从给定位置提取文本并将它们保存到 Sheet1 的单元格中：

Private Sub CommandButton1_Click()

' Create TextExtractor object
' Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
Dim extractor As New Bytescout_PDFExtractor.TextExtractor

extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"

' Load sample PDF document
extractor.LoadDocumentFromFile ("c:\sample1.pdf")

' Get page count
pageCount = extractor.GetPageCount()

Dim wb As Workbook
Dim ws As Worksheet
Dim TxtRng  As Range

Set wb = ActiveWorkbook
Set ws = wb.Sheets("Sheet1")

For i = 0 To pageCount - 1
            RectLeft = 10
            RectTop = 10
            RectWidth = 100
            RectHeight = 100

            ' check the same text is extracted from returned coordinates
            extractor.SetExtractionArea RectLeft, RectTop, RectWidth, RectHeight
            ' extract text from given area
            extractedText = extractor.GetTextFromPage(i)

            ' insert rows
            ' Rows(1).Insert shift:=xlShiftDown
            ' write cell value
             Set TxtRng = ws.Range("A" & CStr(i + 2))
             TxtRng.Value = extractedText

Next

Set extractor = Nothing


End Sub

披露：我与 ByteScout 有关

【讨论】：

【参考方案5】：

使用Bytescout PDF Extractor SDK 是一个不错的选择。它很便宜，并且提供了大量与 PDF 相关的功能。上面的答案之一指向 GitHub 上的死页 Bytescout。我提供了一个相关的工作示例来从 PDF 中提取表格。您可以使用它以任何格式导出。

Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor")

extractor.RegistrationName = "demo"
extractor.RegistrationKey = "demo"

' Load sample PDF document
extractor.LoadDocumentFromFile "../../sample3.pdf"

For ipage = 0 To extractor.GetPageCount() - 1 

    ' starting extraction from page #"
    extractor.PrepareStructure ipage

    rowCount = extractor.GetRowCount(ipage)

    For row = 0 To rowCount - 1 
        columnCount = extractor.GetColumnCount(ipage, row)

        For col = 0 To columnCount-1
            WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _
                CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col)
        Next
    Next
Next

这里有更多示例：https://github.com/bytescout/pdf-extractor-sdk-samples

【讨论】：

【参考方案6】：

由于我不喜欢依赖外部库和/或其他程序，因此我扩展了您的解决方案以使其正常工作。此处的实际更改是使用 GetFromClipboard 函数而不是 Paste 函数，后者主要用于粘贴一系列单元格。当然，缺点是用户在整个过程中不能改变焦点或干预。

Dim pathPDF As String, textPDF As String
Dim openPDF As Object
Dim objPDF As MsForms.DataObject

pathPDF = "C:\some\path\data.pdf"
Set openPDF = CreateObject("Shell.Application")
openPDF.Open (pathPDF)
'TIME TO WAIT BEFORE/AFTER COPY AND PASTE SENDKEYS
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^a"
Application.Wait Now + TimeValue("00:00:2")
SendKeys "^c"
Application.Wait Now + TimeValue("00:00:1")

AppActivate ActiveWorkbook.Windows(1).Caption
objPDF.GetFromClipboard
textPDF = objPDF.GetText(1)
MsgBox textPDF

如果您有兴趣，请在github 中查看我的项目。

【讨论】：

【参考方案7】：

为了改进 Slinky Sloth 的解决方案，我必须在从剪贴板获取之前添加这个：

Set objPDF = New MSForms.DataObject

遗憾的是，它不适用于 10 页的 pdf。

【讨论】：

【参考方案8】：

这似乎不适用于 Adobe Type 库。一旦打开，我就会收到 429 错误。虽然 Acrobat 工作正常...

【讨论】：

以上是关于从 PDF 中提取数据并添加到工作表的主要内容，如果未能解决你的问题，请参考以下文章