如何提取PDFBox的超链接信息

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了如何提取PDFBox的超链接信息相关的知识,希望对你有一定的参考价值。

我试图使用PDFBox从PDF中提取超链接信息,但我不确定如何获取

for( Object p : pages ) {
    PDPage page = (PDPage)p;

    List<?> annotations = page.getAnnotations();
    for( Object a : annotations ) {
        PDAnnotation annotation = (PDAnnotation)a;

        if( annotation instanceof PDAnnotationLink ) {
            PDAnnotationLink link = (PDAnnotationLink)annotation;
            System.out.println(link.toString());
            System.out.println(link.getDestination());

        }
    }

}

我想提取超链接目标的URL和超链接的文本。怎么能这样做?

谢谢

答案

使用源代码下载中PrintURLs sample code的代码:

for( PDPage page : doc.getPages() )
{
    pageNum++;
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    List<PDAnnotation> annotations = page.getAnnotations();
    //first setup text extraction regions
    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDRectangle rect = link.getRectangle();
            //need to reposition link rectangle to match text space
            float x = rect.getLowerLeftX();
            float y = rect.getUpperRightY();
            float width = rect.getWidth();
            float height = rect.getHeight();
            int rotation = page.getRotation();
            if( rotation == 0 )
            {
                PDRectangle pageSize = page.getMediaBox();
                y = pageSize.getHeight() - y;
            }
            else if( rotation == 90 )
            {
                //do nothing
            }

            Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );
            stripper.addRegion( "" + j, awtRect );
        }
    }

    stripper.extractRegions( page );

    for( int j=0; j<annotations.size(); j++ )
    {
        PDAnnotation annot = annotations.get(j);
        if( annot instanceof PDAnnotationLink )
        {
            PDAnnotationLink link = (PDAnnotationLink)annot;
            PDAction action = link.getAction();
            String urlText = stripper.getTextForRegion( "" + j );
            if( action instanceof PDActionURI )
            {
                PDActionURI uri = (PDActionURI)action;
                System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );
            }
        }
    }
}

它分为两个部分,一个是获取简单的URL,另一个是获取URL文本,这是通过在注释的矩形处进行文本提取来完成的。

另一答案

我们必须获得超链接信息和内部链接(例如移动页面....)。我使用下面的代码:

int pageNum = 0;
            for (PDPage page : originalPDF.getPages()) {
                pageNum++;
                List<PDAnnotation> annotations = page.getAnnotations();
                for (PDAnnotation annot : annotations) {
                    if (annot instanceof PDAnnotationLink) {
                        // get dimension of annottations
                        PDAnnotationLink link = (PDAnnotationLink) annot;
                        // get link action include link url and internal link
                        PDAction action = link.getAction();
                        // get link internal some case specal
                        PDDestination pDestination = link.getDestination();

                        if (action != null) {
                            if (action instanceof PDActionURI || action instanceof PDActionGoTo) {
                                if (action instanceof PDActionURI) {
                                    // get uri link
                                    PDActionURI uri = (PDActionURI) action;
                                    System.out.println("uri link:" + uri.getURI());
                                } else {
                                    if (action instanceof PDActionGoTo) {
                                        // get internal link
                                        PDDestination destination = ((PDActionGoTo) action).getDestination();
                                        PDPageDestination pageDestination;
                                        if (destination instanceof PDPageDestination) {
                                            pageDestination = (PDPageDestination) destination;
                                        } else {
                                            if (destination instanceof PDNamedDestination) {
                                                pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) destination);
                                            } else {
                                                // error handling
                                                break;
                                            }
                                        }

                                        if (pageDestination != null) {
                                            System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
                                        }
                                    }
                                }
                            }
                        } else {
                            if (pDestination != null) {
                                PDPageDestination pageDestination;
                                if (pDestination instanceof PDPageDestination) {
                                    pageDestination = (PDPageDestination) pDestination;
                                } else {
                                    if (pDestination instanceof PDNamedDestination) {
                                        pageDestination = originalPDF.getDocumentCatalog().findNamedDestinationPage((PDNamedDestination) pDestination);
                                    } else {
                                        // error handling
                                        break;
                                    }
                                }

                                if (pageDestination != null) {
                                    System.out.println("page move: " + (pageDestination.retrievePageNumber() + 1));
                                }
                            } else {
                                //    
                            }
                        }
                    }
                }

            }

以上是关于如何提取PDFBox的超链接信息的主要内容,如果未能解决你的问题,请参考以下文章

PDFBox 创建带有外部 mp3 或 wav 文件的链接/引用的 Sound 对象

在一个excel文件里头有1600个PDF的超链接,如何批量下载?

使用 pdfbox 从 pdf 中提取图像

PDFBox 2.0:在 TextStripper 中获取颜色信息

用C#怎么提取a标签的超链接?

java代码实现对pdf文件文本和图片内容的提取