如何提取PDFBox的超链接信息
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何提取PDFBox的超链接信息相关的知识,希望对你有一定的参考价值。
我试图使用PDFBox从PDF中提取超链接信息,但我不确定如何获取
for( Object p : pages ) {
PDPage page = (PDPage)p;
List<?> annotations = page.getAnnotations();
for( Object a : annotations ) {
PDAnnotation annotation = (PDAnnotation)a;
if( annotation instanceof PDAnnotationLink ) {
PDAnnotationLink link = (PDAnnotationLink)annotation;
System.out.println(link.toString());
System.out.println(link.getDestination());
}
}
}
我想提取超链接目标的URL和超链接的文本。怎么能这样做?
谢谢
答案
使用源代码下载中PrintURLs sample code的代码:
for( PDPage page : doc.getPages() )
{
pageNum++;
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
List<PDAnnotation> annotations = page.getAnnotations();
//first setup text extraction regions
for( int j=0; j<annotations.size(); j++ )
{
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDRectangle rect = link.getRectangle();
//need to reposition link rectangle to match text space
float x = rect.getLowerLeftX();
float y = rect.getUpperRightY();
float width = rect.getWidth();
float height = rect.getHeight();
int rotation = page.getRotation();
if( rotation == 0 )
{
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
}
else if( rotation == 90 )
{
//do nothing
}
Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
stripper.addRegion( "" + j, awtRect );
}
}
stripper.extractRegions( page );
for( int j=0; j<annotations.size(); j++ )
{
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
{
PDAnnotationLink link = (PDAnnotationLink)annot;
PDAction action = link.getAction();
String urlText = stripper.getTextForRegion( "" + j );
if( action instanceof PDActionURI )
{
PDActionURI uri = (PDActionURI)action;
System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
}
}
}
}
它分为两个部分,一个是获取简单的URL,另一个是获取URL文本,这是通过在注释的矩形处进行文本提取来完成的。
另一答案
我们必须获得超链接信息和内部链接(例如移动页面....)。我使用下面的代码:
int pageNum = 0;
for (PDPage page : originalPDF.getPages()) {
pageNum++;
List<PDAnnotation> annotations = page.getAnnotations();
for (PDAnnotation annot : annotations) {
if (annot instanceof PDAnnotationLink) {
// get dimension of annottations
PDAnnotationLink link = (PDAnnotationLink) annot;
// get link action include link url and internal link
PDAction action = link.getAction();
// get link internal some case specal
PDDestination pDestination = link.getDestination();
if (action != null) {
if (action instanceof PDActionURI || action instanceof PDActionGoTo) {
if (action instanceof PDActionURI) {
// get uri link
PDActionURI uri = (PDActionURI) action;
System.out.println("uri link:" + uri.getURI());
} else {
if (action instanceof PDActionGoTo) {
// get internal link
PDDestination destination = ((PDActionGoTo) action).getDestination();
PDPageDestination pageDestination;
if (destination instanceof PDPageDestination) {
pageDestination = (PDPageDestination) destination;
} else {
if (destination instanceof PDNamedDestination) {
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination);
} else {
// error handling
break;
}
}
if (pageDestination != null) {
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
}
}
}
}
} else {
if (pDestination != null) {
PDPageDestination pageDestination;
if (pDestination instanceof PDPageDestination) {
pageDestination = (PDPageDestination) pDestination;
} else {
if (pDestination instanceof PDNamedDestination) {
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination);
} else {
// error handling
break;
}
}
if (pageDestination != null) {
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
}
} else {
//
}
}
}
}
}
以上是关于如何提取PDFBox的超链接信息的主要内容,如果未能解决你的问题,请参考以下文章
PDFBox 创建带有外部 mp3 或 wav 文件的链接/引用的 Sound 对象
在一个excel文件里头有1600个PDF的超链接,如何批量下载?