如何获取 PDF 内文本的位置/坐标?
Posted
技术标签:
【中文标题】如何获取 PDF 内文本的位置/坐标?【英文标题】:How to get the position / coordinates of an text inside of a PDF? 【发布时间】:2020-01-28 08:45:03 【问题描述】:我目前正在开发一个 Ruby on Rails 应用程序,它将在特定位置插入图像。 但这个位置必须先确定。因此,我尝试确定文本“客户签名”和 PDF 中的相应位置。 用gem pdf-reader查找文本是没有问题的,但是如何获取这个文本的位置来绘制签名图像呢?
如果 gem pdf-reader 无法做到这一点,我也很感谢命令行程序的替代解决方案。
【问题讨论】:
你有没有这个问题(更重要的是,答案):***.com/questions/22898145/…? 谢谢,这看起来非常好,尤其是在 Python 已经在使用的情况下。我已经在 Ruby 中找到了一个解决方案以及 pdf-reader gem:blog.peschla.net/2014/04/… 【参考方案1】:我在这个网站上找到了我的问题的答案:http://blog.peschla.net/2014/04/parsing-pdf-text-with-coordinates-in-ruby/
它也适用于当前的 pdf-reader gem。
#! /usr/bin/ruby
require 'pdf-reader'
class CustomPageLayout < PDF::Reader::PageLayout
attr_reader :runs
# we need to filter duplicate characters which seem to be caused by shadowing
def group_chars_into_runs(chars)
# filter out duplicate chars before going on with regular logic,
# seems to happen with shadowed text
chars.uniq! |val| x: val.x, y: val.y, text: val.text
super
end
end
class PageTextReceiverKeepSpaces < PDF::Reader::PageTextReceiver
# We must expose the characters and mediabox attributes to instantiate PageLayout
attr_reader :characters, :mediabox
private
def internal_show_text(string)
if @state.current_font.nil?
raise PDF::Reader::MalformedPDFError, "current font is invalid"
end
glyphs = @state.current_font.unpack(string)
glyphs.each_with_index do |glyph_code, index|
# paint the current glyph
newx, newy = @state.trm_transform(0,0)
utf8_chars = @state.current_font.to_utf8(glyph_code)
# apply to glyph displacment for the current glyph so the next
# glyph will appear in the correct position
glyph_width = @state.current_font.glyph_width(glyph_code) / 1000.0
th = 1
scaled_glyph_width = glyph_width * @state.font_size * th
# modification to the original pdf-reader code which otherwise accidentally removes spaces in some cases
# unless utf8_chars == SPACE
@characters << PDF::Reader::TextRun.new(newx, newy, scaled_glyph_width, @state.font_size, utf8_chars)
# end
@state.process_glyph_displacement(glyph_width, 0, utf8_chars == SPACE)
end
end
end
class PDFTextProcessor
MAX_KERNING_DISTANCE = 10 # experimental value
# pages may specify which pages to actually parse (zero based)
# [0, 3] will process only the first and fourth page if present
def self.process(pdf_io, pages = nil)
pdf_io.rewind
reader = PDF::Reader.new(pdf_io)
fail 'Could not find any pages in the given document' if reader.pages.empty?
processed_pages = []
text_receiver = PageTextReceiverKeepSpaces.new
requested_pages = pages ? reader.pages.values_at(*pages) : reader.pages
requested_pages.each do |page|
unless page.nil?
page.walk(text_receiver)
runs = CustomPageLayout.new(text_receiver.characters, text_receiver.mediabox).runs
# sort text runs from top left to bottom right
# read as: if both runs are on the same line first take the leftmost, else the uppermost - (0,0) is bottom left
runs.sort! |r1, r2| r2.y == r1.y ? r1.x <=> r2.x : r2.y <=> r1.y
# group runs by lines and merge those that are close to each other
lines_hash =
runs.each do |run|
lines_hash[run.y] ||= []
# runs that are very close to each other are considered to belong to the same text "block"
if lines_hash[run.y].empty? || (lines_hash[run.y].last.last.endx + MAX_KERNING_DISTANCE < run.x)
lines_hash[run.y] << [run]
else
lines_hash[run.y].last << run
end
end
lines = []
lines_hash.each do |y, run_groups|
lines << y: y, text_groups: []
run_groups.each do |run_group|
group_text = run_group.map |run| run.text .join('').strip
lines.last[:text_groups] << (
x: run_group.first.x,
width: run_group.last.endx - run_group.first.x,
text: group_text,
) unless group_text.empty?
end
end
# consistent indexing with pages param and reader.pages selection
processed_pages << page: page.number, lines: lines
end
end
processed_pages
end
end
if File.exists?(ARGV[0])
file = File.open(ARGV[0])
pages = PDFTextProcessor.process(file)
puts pages
puts "Parsed #pages.count pages"
else
puts "Cannot open file '#ARGV[0]' (or no file given)"
end
带有文本和坐标的示例输出:
page: 1,
lines: [
y: 771.4006,
text_groups: [
x: 60.7191, width: 164.6489200000004, text: "Some text on the left",
x: 414.8391, width: 119.76381600000008, text: "Some text on the right"
]
,
y: 750.7606,
text_groups: [x: 60.7191, width: 88.51979999999986, text: "More text"]
]
【讨论】:
以上是关于如何获取 PDF 内文本的位置/坐标?的主要内容,如果未能解决你的问题,请参考以下文章
如何在 Ruby 中使用“PDF-Reader”gem 获取文本的位置