PDFBox如何提取超链接信息
Posted
技术标签:
【中文标题】PDFBox如何提取超链接信息【英文标题】:How to extract hyperlink information PDFBox 【发布时间】:2016-11-29 23:10:00 【问题描述】:我正在尝试使用 PDFBox 从 PDF 中提取超链接信息,但我不确定如何获取
for( Object p : pages )
PDPage page = (PDPage)p;
List<?> annotations = page.getAnnotations();
for( Object a : annotations )
PDAnnotation annotation = (PDAnnotation)a;
if( annotation instanceof PDAnnotationLink )
PDAnnotationLink link = (PDAnnotationLink)annotation;
System.out.println(link.toString());
System.out.println(link.getDestination());
我想提取超链接目的地的 url 和超链接的文本。怎么可能做到这一点?
谢谢
【问题讨论】:
【参考方案1】:使用来自源代码下载的PrintURLs sample code的这段代码:
for( PDPage page : doc.getPages() )
pageNum++;
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
List<PDAnnotation> annotations = page.getAnnotations();
//first setup text extraction regions
for( int j=0; j<annotations.size(); j++ )
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
PDAnnotationLink link = (PDAnnotationLink)annot;
PDRectangle rect = link.getRectangle();
//need to reposition link rectangle to match text space
float x = rect.getLowerLeftX();
float y = rect.getUpperRightY();
float width = rect.getWidth();
float height = rect.getHeight();
int rotation = page.getRotation();
if( rotation == 0 )
PDRectangle pageSize = page.getMediaBox();
y = pageSize.getHeight() - y;
else if( rotation == 90 )
//do nothing
Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
stripper.addRegion( "" + j, awtRect );
stripper.extractRegions( page );
for( int j=0; j<annotations.size(); j++ )
PDAnnotation annot = annotations.get(j);
if( annot instanceof PDAnnotationLink )
PDAnnotationLink link = (PDAnnotationLink)annot;
PDAction action = link.getAction();
String urlText = stripper.getTextForRegion( "" + j );
if( action instanceof PDActionURI )
PDActionURI uri = (PDActionURI)action;
System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
它分为两部分,一个是简单的获取URL,另一个是获取URL文本,这是通过在注释的矩形处提取文本来完成的。
【讨论】:
这段代码很好地提取了 pdf 上的外部链接。但它似乎没有提取到内部页面的链接。例如,在我的 pdf 的第 3 页上,它包含指向第 10 页的链接。我也需要获取该信息。知道怎么做吗? @ShiranSEkanayake 请查看其他回复。底部(带有 PDPageDestination)应该做你想做的事。我没有测试它,但它对我来说看起来不错。【参考方案2】:我们必须获取超链接信息和内部链接(例如移动页面......)。我使用下面的代码:
int pageNum = 0;
for (PDPage page : originalPDF.getPages())
pageNum++;
List<PDAnnotation> annotations = page.getAnnotations();
for (PDAnnotation annot : annotations)
if (annot instanceof PDAnnotationLink)
// get dimension of annottations
PDAnnotationLink link = (PDAnnotationLink) annot;
// get link action include link url and internal link
PDAction action = link.getAction();
// get link internal some case specal
PDDestination pDestination = link.getDestination();
if (action != null)
if (action instanceof PDActionURI || action instanceof PDActionGoTo)
if (action instanceof PDActionURI)
// get uri link
PDActionURI uri = (PDActionURI) action;
System.out.println("uri link:" + uri.getURI());
else
if (action instanceof PDActionGoTo)
// get internal link
PDDestination destination = ((PDActionGoTo) action).getDestination();
PDPageDestination pageDestination;
if (destination instanceof PDPageDestination)
pageDestination = (PDPageDestination) destination;
else
if (destination instanceof PDNamedDestination)
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination);
else
// error handling
break;
if (pageDestination != null)
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
else
if (pDestination != null)
PDPageDestination pageDestination;
if (pDestination instanceof PDPageDestination)
pageDestination = (PDPageDestination) pDestination;
else
if (pDestination instanceof PDNamedDestination)
pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination);
else
// error handling
break;
if (pageDestination != null)
System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
else
//
【讨论】:
以上是关于PDFBox如何提取超链接信息的主要内容,如果未能解决你的问题,请参考以下文章
从 Open Office Calc Sheet 中提取超链接
Python - Win32Com - 如何从 Excel 电子表格单元格中提取超链接?