用java如何提取pdf中的标题和作者
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了用java如何提取pdf中的标题和作者相关的知识,希望对你有一定的参考价值。
使用xpdf
PDDocument document=PDDocument.load(fis);PDDocumentInformation info = document.getDocumentInformation();
System.out.println("页数:"+document.getNumberOfPages());
System.out.println( "标题:" + info.getTitle() );
System.out.println( "主题:" + info.getSubject() );
System.out.println( "作者:" + info.getAuthor() );
System.out.println( "关键字:" + info.getKeywords() );
System.out.println( "应用程序:" + info.getCreator() );
System.out.println( "pdf 制作程序:" + info.getProducer() );
System.out.println( "Trapped:" + info.getTrapped() );
System.out.println( "创建时间:" + dateFormat( info.getCreationDate() ));
System.out.println( "修改时间:" + dateFormat( info.getModificationDate())); 参考技术A pdf不是文本文件,是不能提取的。除非你把它转换成文本。 参考技术B 使用xunjiePDF编辑器 这个软件,进行提取。
1 在 PDF工具中打开 PDF 并选择“文档”>“提取页面”。
2 请指定要提取的页面的范围。
3 请在“提取页面”对话框中,执行以下一个或多个操作,然后单击 “确定”:
如何减小pdf中png图像的大小(压缩pdf中的png)
【中文标题】如何减小pdf中png图像的大小(压缩pdf中的png)【英文标题】:how to reduce the size of png image in pdf (compress png in pdf) 【发布时间】:2020-08-18 18:13:38 【问题描述】:我想通过将高分辨率图像替换为低分辨率图像来减小 pdf 文件的大小。要完成这个问题,我必须:
-
从 pdf 中提取图像(流)
压缩图片
用压缩图像替换 pdf 中的图像(流)
当我提取 png 图像并替换它们时,透明背景变为黑色背景。我从 pdf 中提取图像以找出原因。 pdf用于流式传输以保存png有一些非常奇怪的东西。因此,如果我尝试从 pdf 中提取 png 图像,我会得到两个不同的图像:一个 8 位彩色图像和一个 24 位彩色图像。
...
1 0 obj
<</Type/XObject/Subtype/Image/Width 1920/Height 1035/Length 24720/ColorSpace/DeviceGray/BitsPerComponent 8/Filter/FlateDecode>>stream
...
endstream
endobj
2 0 obj
<</Type/XObject/Subtype/Image/Width 1920/Height 1035/SMask 1 0 R/Length 47751/ColorSpace[/CalRGB<</Gamma[2.2 2.2 2.2]/Matrix[0.41239 0.21264 0.01933 0.35758 0.71517 0.11919 0.18045 0.07218 0.9504]/WhitePoint[0.95043 1 1.09]>>]/Intent/Perceptual/BitsPerComponent 8/Filter/FlateDecode>>stream
...
endstream
...
原图(透明背景的32位彩色图像):
8 位彩色图像:
24 位彩色图像:
<dependency>
<groupId>com.itextpdf</groupId>
<artifactId>itextpdf</artifactId>
<version>5.5.12</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>2.0.16</version>
</dependency>
ImageExtractor
将帮助您从 Pdf 文件中提取图像。
public class ImageExtractor
private static final Logger log = LoggerFactory.getLogger(ImageExtractor.class);
public void extract(File pdf, File imageDir) throws IOException
if(!imageDir.exists())
imageDir.mkdirs();
PDDocument document = PDDocument.load(pdf);
PDPageTree list = document.getPages();
System.out.println("PDPageTree#count: " + list.getCount());
int pageIndex = 1;
for (PDPage page : list)
PDResources pdResources = page.getResources();
System.out.println(pdResources.toString());
for (COSName c : pdResources.getXObjectNames())
System.out.println("PDResources[" + pageIndex + "]#COSName: " + c.getName());
PDXObject o = pdResources.getXObject(c);
System.out.println("PDResources[" + pageIndex + "]#PDXObject: " + o.toString());
// https://github.com/mkl-public/testarea-itext5/blob/master/src/test/java/mkl/testarea/itext5/extract/ImageExtraction.java
if (o instanceof PDImageXObject)
PDImageXObject img = (PDImageXObject) o;
File file = new File(imageDir, pageIndex + "-" + System.nanoTime() + "." + img.getSuffix());
ImageIO.write(((PDImageXObject)o).getImage(), img.getSuffix(), file);
pageIndex ++;
log.info("Images have been extracted successfully! Check your images folder.");
ReplaceHightResolutionImage
是我用来减小 pdf 大小的代码。
package io.gitlab.donespeak.tutorial.pdf.reducesize.itext;
import com.itextpdf.text.DocumentException;
import com.itextpdf.text.pdf.PRStream;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfNumber;
import com.itextpdf.text.pdf.PdfObject;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfStamper;
import com.itextpdf.text.pdf.PdfStream;
import com.itextpdf.text.pdf.parser.PdfImageObject;
import io.gitlab.donespeak.tutorial.pdf.reducesize.imagecompress.ImageCompressor;
import io.gitlab.donespeak.tutorial.pdf.reducesize.imagecompress.SimpleCompress;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
public class ReplaceHightResolutionImage
private ImageCompressor compressor;
private double quality;
private double scale;
public ReplaceHightResolutionImage(double quality, double scale)
this.compressor = new SimpleCompress();
this.quality = quality;
this.scale = scale;
public ReplaceHightResolutionImage(double quality, double scale, ImageCompressor compressor)
this.compressor = compressor;
this.quality = quality;
this.scale = scale;
public void replace(File pdf, File output) throws IOException, DocumentException
PdfReader reader = new PdfReader(new FileInputStream(pdf));
int n = reader.getXrefSize();
PdfObject object;
PRStream stream;
for (int i = 0; i < n; i++)
object = reader.getPdfObject(i);
stream = findImageStream(object);
if (stream == null)
continue;
PdfImageObject pdfImageObject = new PdfImageObject(stream);
BufferedImage bi = pdfImageObject.getBufferedImage();
if (bi == null)
continue;
System.out.println("PdfReader#Xref: " + i + "," + pdfImageObject.getFileType());
BufferedImage resultImage = compressor.compress(bi, pdfImageObject.getFileType(), quality, scale);
replaceImage(stream, resultImage);
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(output));
// furtherCompress(reader, stamper);
stamper.close();
private void furtherCompress(PdfReader reader, PdfStamper stamper) throws DocumentException
reader.removeFields();
reader.removeUnusedObjects();
stamper.setFullCompression();
stamper.getWriter().setCompressionLevel(PdfStream.DEFAULT_COMPRESSION);
private PRStream findImageStream(PdfObject object)
PRStream stream;
if (object == null || !object.isStream())
return null;
stream = (PRStream)object;
System.out.println(stream.getAsName(PdfName.SUBTYPE));
if (!PdfName.IMAGE.equals(stream.getAsName(PdfName.SUBTYPE)))
// not jpg or png
return null;
PdfName pdfName = stream.getAsName(PdfName.FILTER);
if (!PdfName.DCTDECODE.equals(pdfName) && !PdfName.FLATEDECODE.equals(pdfName))
return null;
// if (PdfName.DCTDECODE.equals(filter))
// return PdfImageObject.ImageBytesType.JPG.getFileExtension();
// else if (PdfName.JPXDECODE.equals(filter))
// return PdfImageObject.ImageBytesType.JP2.getFileExtension();
// else if (PdfName.FLATEDECODE.equals(filter))
// return PdfImageObject.ImageBytesType.PNG.getFileExtension();
// else if (PdfName.LZWDECODE.equals(filter))
// return PdfImageObject.ImageBytesType.CCITT.getFileExtension();
//
return stream;
private void replaceImage(PRStream stream, BufferedImage resultImage) throws IOException
ByteArrayOutputStream imgBytes = new ByteArrayOutputStream();
ImageIO.write(resultImage, "JPG", imgBytes);
stream.clear();
stream.setData(imgBytes.toByteArray(), false, PRStream.NO_COMPRESSION);
stream.put(PdfName.TYPE, PdfName.XOBJECT);
stream.put(PdfName.SUBTYPE, PdfName.IMAGE);
stream.put(PdfName.FILTER, PdfName.DCTDECODE);
stream.put(PdfName.WIDTH, new PdfNumber(resultImage.getWidth()));
stream.put(PdfName.HEIGHT, new PdfNumber(resultImage.getHeight()));
stream.put(PdfName.BITSPERCOMPONENT, new PdfNumber(8));
stream.put(PdfName.COLORSPACE, PdfName.DEVICERGB);
package io.gitlab.donespeak.tutorial.pdf.reducesize.itext;
public class ThumbnailatorCompressor implements ImageCompressor
@Override
public BufferedImage compress(BufferedImage image, String imageFormat, double quality, double scale) throws IOException
System.out.println("ThumbnailatorCompressor#type: " + image.getType());
// int imageType = "png".equalsIgnoreCase(imageFormat)? BufferedImage.TYPE_INT_ARGB: image.getType();
BufferedImage thumbnail = Thumbnails.of(image)
.imageType(image.getType())
.scale(scale)
.outputQuality(quality)
// .outputFormat(imageFormat)
.useOriginalFormat()
.asBufferedImage();
return thumbnail;
horse.pdf
horse.png
public class ReplaceHightResolutionImageTest
@Test
public void reduceWithThumbnailatorCompressor() throws IOException, DocumentException
double quality = 1d;
double scale = 0.6d;
File pdf = new File("pdf/asset/horse.pdf");
File output = new File("pdf/target/output", "replaced-" + quality + "-" + scale);
ReplaceHightResolutionImage replacer = new ReplaceHightResolutionImage(quality, scale, new SimpleCompress());
replacer.replace(pdf, output);
【问题讨论】:
ISO-32000-1(又名 PDF 规范)仅支持两种类型的图像 - JPEG 和“原始位”。因此,任何不是 JPEG 的源图像格式都被转换为“原始位”——这包括 BMP、TIFF、GIF 等。PNG 由 4 个颜色通道组成:红色、绿色、蓝色和 Alpha。 RGB 本质上与位图图像相同; Alpha 是透明层。因此,iText 在 PDF 中模拟 PNG 的方法是在彼此之上添加不是一个而是两个图像:一个带有颜色的图像流和另一个带有透明蒙版的图像流。这是你提取的两张图片。 所以在我的ReplaceHightResolutionImage
中调用stream.clear()
会破坏png 流的结构并导致黑色背景。如果我想在pdf中压缩一个png图片,我该怎么办?
@DoneSpeak PDF 中没有 PNG 图像。有 JPEG 或原始位。没有PNG!您应该做的是将 RGB 和 Alpha(8 位)图像压缩为两个单独的 JPEG,并将它们替换为原始的、未压缩的原始位图像。
@datenwolf 感谢您的回答。正如您在我的描述中看到的那样,我尝试压缩每个流并将它们替换为原始流。但在那之后,具有透明背景的图像的背景变成了黑色。这让我很困扰。
“我认为应该有一种方法可以提取整个 png 图像并替换整个 png 图像,而不是单独处理 png 流的两个部分” - 有时一个如果可能的话,希望图像提取也将透明度添加到提取的 png 中。但这对您的用例没有帮助,jpeg 不支持透明度(至少不普遍支持),因此这种 完全替换方法 也会降低透明度。
【参考方案1】:
这是一个可行但不够好的答案。它可以很好地压缩 jpg 和 png。唯一的缺点是,如果您在许多页面中重复使用图像,它会将每个图像引用作为单独的流并生成一个新流来代替图像引用,这可能会导致更大的文件大小。
1 0 obj
<</Type/XObject/Subtype/Image/Width 1002/Height 564/Filter/DCTDecode/ColorSpace/DeviceRGB/BitsPerComponent 8/Length 89149>>stream
...
endstream
endobj
2 0 obj
<</Length 106/Filter/FlateDecode>>stream
x�m�=� ��w�^@|���=� 7�/����8�6��&b0$��
��N!o��L�,?Ck'�����c�h�x0��/(5c*�Y�سEX�o�Uj3�B�ݔ"
endstream
endobj
4 0 obj
<</Type/Page/MediaBox[0 0 595 842]/Resources<</XObject<</img0 1 0 R>>>>/Contents 2 0 R/Parent 3 0 R>>
endobj
5 0 obj
<</Length 106/Filter/FlateDecode>>stream
x�m�=� ��w�^@|���=�image 7�/����8�6��&b0$��
��N!o��L�,?Ck'�����c�h�x0��/(5c*�Y�سEX�o�Uj3�B�ݔ"
endstream
endobj
6 0 obj
<</Type/Page/MediaBox[0 0 595 842]/Resources<</XObject<</img0 1 0 R>>>>/Contents 5 0 R/Parent 3 0 R>>
endobj
package io.gitlab.donespeak.tutorial.pdf.reducesize;
import io.gitlab.donespeak.tutorial.pdf.reducesize.imagecompress.ThumbnailatorCompressor;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory;
import org.apache.pdfbox.pdmodel.graphics.image.LosslessFactory;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
public class RemoveAllImageFromPdf
public static void extractImages(File input, File imageDir) throws IOException
if(imageDir.exists())
imageDir.delete();
imageDir.mkdirs();
PDDocument document = PDDocument.load(input);
int pageIndex = 1;
PDDocumentCatalog catalog = document.getDocumentCatalog();
for (PDPage page : catalog.getPages())
PDResources pdResources = page.getResources();
System.out.println(pdResources.toString());
for (COSName c : pdResources.getXObjectNames())
System.out.println("PDResources[" + pageIndex + "]#COSName: " + c.getName());
PDXObject o = pdResources.getXObject(c);
System.out.println("PDResources[" + pageIndex + "]#PDXObject: " + o.toString());
// https://github.com/mkl-public/testarea-itext5/blob/master/src/test/java/mkl/testarea/itext5/extract/ImageExtraction.java
if (o instanceof PDImageXObject)
PDImageXObject img = (PDImageXObject) o;
System.out.println(img.getSuffix() + "-" + img.getBitsPerComponent() + "-" + img.getColorSpace());
File file = new File(imageDir, pageIndex + "-" + c.getName() + "-" + img.getColorSpace() + "-" + System.nanoTime() + "." + img.getSuffix());
ImageIO.write(((PDImageXObject)o).getImage(), img.getSuffix(), file);
pageIndex ++;
// document.save(output);
/**
*
* @param input
* @param output
* @throws IOException
*/
public static void compress(File input, File output) throws IOException
if(!output.getParentFile().exists())
output.getParentFile().mkdirs();
ThumbnailatorCompressor compressor = new ThumbnailatorCompressor();
PDDocument document = PDDocument.load(input);
int pageIndex = 1;
PDDocumentCatalog catalog = document.getDocumentCatalog();
for (PDPage page : catalog.getPages())
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames())
System.out.println("PDResources[" + pageIndex + "]#COSName: " + c.getName());
PDXObject o = pdResources.getXObject(c);
System.out.println("PDResources[" + pageIndex + "]#PDXObject: " + o.toString());
// https://github.com/mkl-public/testarea-itext5/blob/master/src/test/java/mkl/testarea/itext5/extract/ImageExtraction.java
if (o instanceof PDImageXObject)
PDImageXObject img = (PDImageXObject) o;
BufferedImage bufferedImage = compressor.compress(img.getImage(), img.getSuffix(), 0.8, 0.5);
PDImageXObject imgNew = null;
System.out.println("img(w, h): (" + img.getWidth() + "," + img.getHeight() + ")");
System.out.println("bufferedImage(w, h): (" + bufferedImage.getWidth() + "," + bufferedImage.getHeight() + ")");
if("png".equalsIgnoreCase(img.getSuffix()))
imgNew = LosslessFactory.createFromImage(document, bufferedImage);
else
imgNew = JPEGFactory.createFromImage(document, bufferedImage);
pdResources.put(c, imgNew);
pageIndex ++;
if(!output.getParentFile().exists())
output.getParentFile().mkdirs();
document.save(output);
document.close();
通过以下方法直接处理文档中的对象,或许可以解决上面的问题。但我不知道如何以这种方式替换流。
new com.itextpdf.text.pdf.PdfReader(new FileInputStream(pdf)).getPdfObject(i);
// or
org.apache.pdfbox.pdmodel.PDDocument.load(pdf).getDocument().getObjects()
【讨论】:
以上是关于用java如何提取pdf中的标题和作者的主要内容,如果未能解决你的问题,请参考以下文章