我无法从嵌入式 PDF (Ruby) 中提取数据

Posted

技术标签:

【中文标题】我无法从嵌入式 PDF (Ruby) 中提取数据【英文标题】:I can not extract data from an embedded PDF (Ruby) 【发布时间】:2013-12-11 11:44:56 【问题描述】:

我正在尝试从嵌入网页的 PDF 中提取文本。我尝试使用 pdf-reader gem,但出现解析错误。

`find_first_xref_offset': PDF does not contain EOF marker (PDF::Reader::MalformedPDFError)
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:99:in `load_offsets'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:60:in `initialize'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `new'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `initialize'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `new'
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `initialize'
from role.rb:5:in `new'
from role.rb:5:in `<main>'

this is the file

有人知道我该如何解决这个问题吗? 有更好的宝石用于此目的吗?

谢谢

【问题讨论】:

【参考方案1】:

我在 Google 上查找您的问题时发现了这一点。它可能会提供一些可以用来解决问题的方法?

#################################################################
# Extract text from a PDF file
# This scraper takes about 2 minutes to run and no output
# appears until the end.
#################################################################
# This scraper uses the pdf-reader gem.
# Documentation is at https://github.com/yob/pdf-reader#readme
# If you have problems you can ask for help at http://groups.google.com/group/pdf-reader
require 'pdf-reader'   
require 'open-uri'

##########  This section contains the callback code that processes the PDF file contents  ######
class PageTextReceiver
  attr_accessor :content, :page_counter
  def initialize
    @content = []
    @page_counter = 0
  end
  # Called when page parsing starts
  def begin_page(arg = nil)
    @page_counter += 1
    @content << ""
  end
  # record text that is drawn on the page
  def show_text(string, *params)
    @content.last << string
  end
  # there's a few text callbacks, so make sure we process them all
  alias :super_show_text :show_text
  alias :move_to_next_line_and_show_text :show_text
  alias :set_spacing_next_line_show_text :show_text
  # this final text callback takes slightly different arguments
  def show_text_with_positioning(*params)
    params = params.first
    params.each  |str| show_text(str) if str.kind_of?(String)
  end
end
################  End of TextReceiver #############################

# If you don't have two minutes to wait you might prefer this
# smaller pdf
# pdf = open('http://www.hmrc.gov.uk/factsheets/import-export.pdf')
# pdf = open('http://www.madingley.org/uploaded/Hansard_08.07.2010.pdf') 
pdf = open('http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf')

#######  Instantiate the receiver and the reader
receiver = PageTextReceiver.new
pdf_reader = PDF::Reader.new 
#######  Now you just need to make the call to parse...
pdf_reader.parse(pdf, receiver)
#######  ...and do whatever you want with the text.  
#######  This just outputs it.
receiver.content.each |r| puts r.strip

【讨论】:

我还是有同样的问题。我尝试通过 url 直接访问该文件,然后下载 PDF 以在本地阅读。 This is the file

以上是关于我无法从嵌入式 PDF (Ruby) 中提取数据的主要内容,如果未能解决你的问题,请参考以下文章

如何从 PDF 中提取嵌入字体作为有效字体文件?

请问如何查看pdf内嵌字体?

从中提取拇指管C#中嵌入的URL

安装 ruby​​gems 时出现 SSL 错误,无法从“https://rubygems.org/”中提取数据

从pdf文件中提取特定数据

从加密的PDF中提取Python数据