PDF读取内容流时出错

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了PDF读取内容流时出错相关的知识,希望对你有一定的参考价值。

我正在努力捕获对show的postscript调用,并将currentfont和font size存储到pdf Text对象的输出中。

PDF file Input Postscript Program

但是identify给了我一个错误:

$ identify pd0.pdf
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

pd0.pdf[0] PBM 612x792 612x792+0+0 16-bit Bilevel Gray 61KB 0.000u 0:00.000
pd0.pdf[1] PBM 612x792 612x792+0+0 16-bit Bilevel Gray 61KB 0.000u 0:00.000
pd0.pdf[2] PBM 612x792 612x792+0+0 16-bit Bilevel Gray 61KB 0.000u 0:00.000

而ghostscript的输出并没有给我理解问题所需的细节:

$ gsnd -dPDFDEBUG pd0.pdf
GPL Ghostscript 9.18 (2015-10-05)
Copyright (C) 2015 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
<<
/Root 1 0 R
/Size 12 >>
%Resolving: [1 0]
<<
/Type /Catalog /Pages 2 0 R
>>
endobj
%Resolving: [2 0]
<<
/Kids [
3 0 R
6 0 R
9 0 R
]
/Type /Pages /Count 3 >>
endobj
%Resolving: [3 0]
<<
/Parent 2 0 R
/Contents [
5 0 R
]
/MediaBox [
0.0 0.0 612.0 792.0 ]
/Resources <<
/Font <<
/F1 4 0 R
>>
/ProcSet [
/PDF /Text ]
>>
/Type /Page >>
endobj
%Resolving: [6 0]
<<
/Parent 2 0 R
/Contents [
8 0 R
]
/MediaBox [
0.0 0.0 612.0 792.0 ]
/Resources <<
/Font <<
/F2 7 0 R
>>
/ProcSet [
/PDF /Text ]
>>
/Type /Page >>
endobj
%Resolving: [9 0]
<<
/Parent 2 0 R
/Contents [
11 0 R
]
/MediaBox [
0.0 0.0 612.0 792.0 ]
/Resources <<
/Font <<
/F3 10 0 R
>>
/ProcSet [
/PDF /Text ]
>>
/Type /Page >>
endobj
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [1 0]
%Resolving: [1 0]
%Resolving: [1 0]
%Resolving: [1 0]
%Resolving: [2 0]
Processing pages 1 through 3.
Page 1
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [3 0]
%Resolving: [3 0]
%Resolving: [3 0]
%Resolving: [3 0]
%Resolving: [3 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [5 0]
<<
/Length 15660 >>
stream
%FilePosition: 471
endobj
BT
F1
10.0 Tf
%Resolving: [4 0]
<<
/Type /Font /SubType /Type1 /BaseFont /Palatino-Roman >>
endobj
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.
Page 2
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [3 0]
%Resolving: [6 0]
%Resolving: [6 0]
%Resolving: [6 0]
%Resolving: [6 0]
%Resolving: [6 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [8 0]
<<
/Length 31667 >>
stream
%FilePosition: 16474
endobj
BT
F2
10.0 Tf
%Resolving: [7 0]
<<
/Type /Font /SubType /Type1 /BaseFont /Palatino-Roman >>
endobj
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.
Page 3
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [3 0]
%Resolving: [6 0]
%Resolving: [9 0]
%Resolving: [9 0]
%Resolving: [9 0]
%Resolving: [9 0]
%Resolving: [9 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [1 0]
%Resolving: [2 0]
%Resolving: [11 0]
<<
/Length 8335 >>
stream
%FilePosition: 48487
endobj
BT
F3
10.0 Tf
%Resolving: [10 0]
<<
/Type /Font /SubType /Type1 /BaseFont /Palatino-Roman >>
endobj
   **** Error reading a content stream. The page may be incomplete.
   **** File did not complete the page properly and may be damaged.

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

GS>

任何人都可以帮我理解我输出的pdf文件的问题是什么?

答案

PDF中存在许多错误。根据所讨论的PDF查看器,需要修复其中较小或较大的子集,以允许按预期显示PDF。

page content streams

页面内容流的内容如下所示:

BT F1 10.0 Tf 30.0 750.0 Td (<< ) Tj ET BT F1 10.0 Tf 50.0 738.0 Td (/) Tj ET [...]

这里的错误在字体选择说明中:

F1 10.0 Tf

字体名称操作数F1不是作为PDF名称对象给出的(可以通过前导斜杠识别),而是作为通常为指令运算符保留的通用文字。

(另外,这些内容流结构不必要地膨胀,大多数单个文本对象仅绘制一到三个字形并且具有它们自己的(总是相同的)文本字体选择指令。本身不是错误但完全没有必要)

此外,正如@ usr2564301所示,流长度似乎偏离了1。

font resources

每个字体资源如下所示:

<<
  /Type /Font 
  /SubType /Type1 
  /BaseFont /Palatino-Roman 
>>

首先,存在的问题是:正如@KenS已经指出的那样,正确的拼写是Subtype,而不是SubType。

还有另外一个问题:那么短的字体资源字典到PDF 1.7只允许标准的14种字体,而PDF 2.0则不再允许。由于Palatino-Roman显然不是标准的14字体,因此无论如何资源都是不完整的。

根据表109 - ISO 32000-2中类型1字体字典中的条目,

  • Type,Subtype和BaseFont是普遍要求的,
  • FirstChar,LastChar,Widths和FontDescriptor是必需的,但在PDF 1.0-1.7标准14字体的可选项中,
  • 名称在PDF 1.0中是必需的,在PDF 1.1到1.7中是可选的,在PDF 2.0中不推荐使用,以及
  • 编码和ToUnicode是通用的可选项。

根据PDF查看器,您尝试的要求可能看起来更宽松,但如果您不符合规范要求,任何PDF处理器都可能无理由拒绝您的PDF。

cross references

@ usr2564301还提到许多交叉引用表条目(以及对交叉引用表本身的开头的引用)都是1。

它们确实没有指向对象编号/ xref文字,而是指向之前的空白区域。由于在数字/文字之前只需要忽略空格,因此很多PDF处理器都不会注意到。

以上是关于PDF读取内容流时出错的主要内容,如果未能解决你的问题,请参考以下文章

获取响应流时出错 (ReadDone2):接收失败

Linux 上的 NuGet:获取响应流时出错

PDF文件损坏,将内存流移动到文件流时无法修复

传递pdf文件的输入流时管道损坏

在表单中显示 html 页面(连接到流时出错)

安装张量流时出错