如何使用 iText 将带有图像和超链接的 HTML 转换为 PDF?

Posted

技术标签:

【中文标题】如何使用 iText 将带有图像和超链接的 HTML 转换为 PDF?【英文标题】:How can I use iText to convert HTML with images and hyperlinks to PDF? 【发布时间】:2016-06-06 06:19:21 【问题描述】:

我正在尝试在使用MVC、 web forms 的ASP.NET Web 应用程序中使用iTextSharp 将html 转换为PDF<img><a> 元素具有绝对和相对 URL,并且一些<img> 元素是base64。 SO 和 Google 搜索结果中的典型答案使用通用的 HTMLPDF 代码和 XMLWorkerHelper,看起来像这样:

using (var stringReader = new StringReader(xHtml))

    using (Document document = new Document())
    
        PdfWriter writer = PdfWriter.GetInstance(document, stream);
        document.Open();
        XMLWorkerHelper.GetInstance().ParseXHtml(
            writer, document, stringReader
        );
    

所以像这样的示例HTML

<div>
    <h3>HTML Works, but Broken in Converted PDF</h3>
    <div>Relative local <img>: <img src='./../content/images/kuujinbo_320-30.gif' /></div>
    <div>
        Base64 <img>:
        <img src='' />
    </div>
    <div><a href='/somePage.html'>Relative local hyperlink, broken in PDF</a></div>
<div>

生成的 PDF:(1) 缺少所有图像,并且 (2) 具有相对 URL 的所有超链接都已损坏并使用 file URI scheme ( file///XXX...) 而不是指向正确的网站。

SO 的一些答案和 Google 搜索的其他答案建议用绝对 URL 替换相对 URL,对于一次性情况,这完全可以接受。但是,对于这个问题,用硬编码字符串全局替换所有 &lt;img src&gt;&lt;a href&gt; 属性是不可接受的,因此请不要发布这样的答案,因为它会因此被否决。

我正在寻找一种适用于测试、开发和生产环境中的许多不同 Web 应用程序的解决方案。

【问题讨论】:

啊,我没有立即看到这确实是您自己的问题。好的,下次我会好好看看。 【参考方案1】:

开箱即用XMLWorker only understands absolute URIs,因此所描述的问题是预期行为。解析器无法自动推断 URI schemes 或没有一些附加信息的路径。

实施ILinkProvider 修复了损坏的超链接问题,实施IImageProvider 修复了损坏的图像问题。由于两种实现都必须执行URI resolution,这是第一步。下面的帮助类可以做到这一点,并尝试使 web (ASP.NET) 上下文调用(示例如下)尽可能简单:

// resolve URIs for LinkProvider & ImageProvider
public class UriHelper

    /* IsLocal; when running in web context:
     * [1] give LinkProvider http[s] scheme; see CreateBase(string baseUri)
     * [2] give ImageProvider relative path starting with '/' - see:
     *     Join(string relativeUri)
     */
    public bool IsLocal  get; set; 
    public HttpContext HttpContext  get; private set; 
    public Uri BaseUri  get; private set; 

    public UriHelper(string baseUri) : this(baseUri, true) 
    public UriHelper(string baseUri, bool isLocal)
    
        IsLocal = isLocal;
        HttpContext = HttpContext.Current;
        BaseUri = CreateBase(baseUri);
    

    /* get URI for IImageProvider to instantiate iTextSharp.text.Image for 
     * each <img> element in the HTML.
     */
    public string Combine(string relativeUri)
    
        /* when running in a web context, the HTML is coming from a MVC view 
         * or web form, so convert the incoming URI to a **local** path
         */
        if (HttpContext != null && !BaseUri.IsAbsoluteUri && IsLocal)
        
            return HttpContext.Server.MapPath(
                // Combine() checks directory traversal exploits
                VirtualPathUtility.Combine(BaseUri.ToString(), relativeUri)
            );
        
        return BaseUri.Scheme == Uri.UriSchemeFile 
            ? Path.Combine(BaseUri.LocalPath, relativeUri)
            // for this example we're assuming URI.Scheme is http[s]
            : new Uri(BaseUri, relativeUri).AbsoluteUri;
    

    private Uri CreateBase(string baseUri)
    
        if (HttpContext != null)
           // running on a web server; need to update original value  
            var req = HttpContext.Request;
            baseUri = IsLocal
                // IImageProvider; absolute virtual path (starts with '/')
                // used to convert to local file system path. see:
                // Combine(string relativeUri)
                ? req.ApplicationPath
                // ILinkProvider; absolute http[s] URI scheme
                : req.Url.GetLeftPart(UriPartial.Authority)
                    + HttpContext.Request.ApplicationPath;
        

        Uri uri;
        if (Uri.TryCreate(baseUri, UriKind.RelativeOrAbsolute, out uri)) return uri;

        throw new InvalidOperationException("cannot create a valid BaseUri");
    

现在实现ILinkProvider 非常简单,因为UriHelper 提供了基本URI。我们只需要正确的 URI 方案(filehttp[s]):

// make hyperlinks with relative URLs absolute
public class LinkProvider : ILinkProvider

    // rfc1738 - file URI scheme section 3.10
    public const char SEPARATOR = '/';
    public string BaseUrl  get; private set; 

    public LinkProvider(UriHelper uriHelper)
    
        var uri = uriHelper.BaseUri;
        /* simplified implementation that only takes into account:
         * Uri.UriSchemeFile || Uri.UriSchemeHttp || Uri.UriSchemeHttps
         */
        BaseUrl = uri.Scheme == Uri.UriSchemeFile
            // need trailing separator or file paths break
            ? uri.AbsoluteUri.TrimEnd(SEPARATOR) + SEPARATOR
            // assumes Uri.UriSchemeHttp || Uri.UriSchemeHttps
            : BaseUrl = uri.AbsoluteUri;
    

    public string GetLinkRoot()
    
        return BaseUrl;
    

IImageProvider需要实现单个方法 Retrieve(string src),但 Store(string src, Image img) 很容易 - 注意那里的内联 cmets 和 GetImageRootPath()

// handle <img> elements in HTML  
public class ImageProvider : IImageProvider

    private UriHelper _uriHelper;
    // see Store(string src, Image img)
    private Dictionary<string, Image> _imageCache = 
        new Dictionary<string, Image>();

    public virtual float ScalePercent  get; set; 
    public virtual Regex Base64  get; set; 

    public ImageProvider(UriHelper uriHelper) : this(uriHelper, 67f)  
    //              hard-coded based on general past experience ^^^
    // but call the overload to supply your own
    public ImageProvider(UriHelper uriHelper, float scalePercent)
    
        _uriHelper = uriHelper;
        ScalePercent = scalePercent;
        Base64 = new Regex( // rfc2045, section 6.8 (alphabet/padding)
            @"^data:image/[^;]+;base64,(?<data>[a-z0-9+/]+=0,2)$",
            RegexOptions.Compiled | RegexOptions.IgnoreCase
        );
    

    public virtual Image ScaleImage(Image img)
    
        img.ScalePercent(ScalePercent);
        return img;
    

    public virtual Image Retrieve(string src)
    
        if (_imageCache.ContainsKey(src)) return _imageCache[src];

        try
        
            if (Regex.IsMatch(src, "^https?://", RegexOptions.IgnoreCase))
            
                return ScaleImage(Image.GetInstance(src));
            

            Match match;
            if ((match = Base64.Match(src)).Length > 0)
            
                return ScaleImage(Image.GetInstance(
                    Convert.FromBase64String(match.Groups["data"].Value)
                ));
            

            var imgPath = _uriHelper.Combine(src);
            return ScaleImage(Image.GetInstance(imgPath));
        
        // not implemented to keep the SO answer (relatively) short
        catch (BadElementException ex)  return null; 
        catch (IOException ex)  return null; 
        catch (Exception ex)  return null; 
    

    /*
     * always called after Retrieve(string src):
     * [1] cache any duplicate <img> in the HTML source so the image bytes
     *     are only written to the PDF **once**, which reduces the 
     *     resulting file size.
     * [2] the cache can also **potentially** save network IO if you're
     *     running the parser in a loop, since Image.GetInstance() creates
     *     a WebRequest when an image resides on a remote server. couldn't
     *     find a CachePolicy in the source code
     */
    public virtual void Store(string src, Image img)
    
        if (!_imageCache.ContainsKey(src)) _imageCache.Add(src, img);
    

    /* XMLWorker documentation for ImageProvider recommends implementing
     * GetImageRootPath():
     * 
     * http://demo.itextsupport.com/xmlworker/itextdoc/flatsite.html#itextdoc-menu-10
     * 
     * but a quick run through the debugger never hits the breakpoint, so 
     * not sure if I'm missing something, or something has changed internally 
     * with XMLWorker....
     */
    public virtual string GetImageRootPath()  return null; 
    public virtual void Reset()  

基于XML Worker documentation,将上面的ILinkProviderIImageProvider 的实现挂钩到一个简单的解析器类中非常简单:

/* a simple parser that uses XMLWorker and XMLParser to handle converting 
 * (most) images and hyperlinks internally
 */
public class SimpleParser

    public virtual ILinkProvider LinkProvider  get; set; 
    public virtual IImageProvider ImageProvider  get; set; 

    public virtual HtmlPipelineContext HtmlPipelineContext  get; set; 
    public virtual ITagProcessorFactory TagProcessorFactory  get; set; 
    public virtual ICs-s-resolver CssResolver  get; set; 

    /* overloads simplfied to keep SO answer (relatively) short. if needed
     * set LinkProvider/ImageProvider after instantiating SimpleParser()
     * to override the defaults (e.g. ImageProvider.ScalePercent)
     */
    public SimpleParser() : this(null)  
    public SimpleParser(string baseUri)
    
        LinkProvider = new LinkProvider(new UriHelper(baseUri, false));
        ImageProvider = new ImageProvider(new UriHelper(baseUri, true));

        HtmlPipelineContext = new HtmlPipelineContext(null);

        // another story altogether, and not implemented for simplicity 
        TagProcessorFactory = Tags.GetHtmlTagProcessorFactory();
        CssResolver = XMLWorkerHelper.GetInstance().GetDefaultCssResolver(true);
    

    /*
     * when sending XHR via any of the popular javascript frameworks,
     * <img> tags are **NOT** always closed, which results in the 
     * infamous iTextSharp.tool.xml.exceptions.RuntimeWorkerException:
     * 'Invalid nested tag a found, expected closing tag img.' a simple
     * workaround.
     */
    public virtual string SimpleAjaxImgFix(string xHtml)
    
        return Regex.Replace(
            xHtml,
            "(?<image><img[^>]+)(?<=[^/])>",
            new MatchEvaluator(match => match.Groups["image"].Value + " />"),
            RegexOptions.IgnoreCase | RegexOptions.Multiline
        );
    

    public virtual void Parse(Stream stream, string xHtml)
    
        xHtml = SimpleAjaxImgFix(xHtml);

        using (var stringReader = new StringReader(xHtml))
        
            using (Document document = new Document())
            
                PdfWriter writer = PdfWriter.GetInstance(document, stream);
                document.Open();

                HtmlPipelineContext
                    .SetTagFactory(Tags.GetHtmlTagProcessorFactory())
                    .SetLinkProvider(LinkProvider)
                    .SetImageProvider(ImageProvider)
                ;
                var pdfWriterPipeline = new PdfWriterPipeline(document, writer);
                var htmlPipeline = new HtmlPipeline(HtmlPipelineContext, pdfWriterPipeline);
                var cssResolverPipeline = new CssResolverPipeline(CssResolver, htmlPipeline);

                XMLWorker worker = new XMLWorker(cssResolverPipeline, true);
                XMLParser parser = new XMLParser(worker);
                parser.Parse(stringReader);
            
        
    

正如内联注释,SimpleAjaxImgFix(string xHtml) 专门处理 XHR that may send unclosed &lt;img&gt; tags,这是 有效 HTML,但 无效 XML 打破XMLWorker 。如何使用 XHR 和 iTextSharp can be found here 接收 PDF 或其他二进制数据的简单解释和实现。

SimpleAjaxImgFix(string xHtml) 中使用了 Regex,因此任何使用(复制/粘贴?)代码的人都不需要添加另一个 nuget 包,而是添加一个 HTML 解析器像HtmlAgilityPack 应该使用,因为轮到这个了:

<div><img src='a.gif'><br><hr></div>

进入这个:

<div><img src='a.gif' /><br /><hr /></div>

只有几行代码:

var hDocument = new HtmlDocument()

    OptionWriteEmptyNodes = true,
    OptionAutoCloseOnEnd = true
;
hDocument.LoadHtml("<div><img src='a.gif'><br><hr></div>");
var closedTags  = hDocument.DocumentNode.WriteTo();

还要注意 - 使用上面的SimpleParser.Parse() 作为通用 蓝图来额外实现自定义ICs-s-resolver 或ITagProcessorFactory,即explained in the documentation。

现在应该注意问题中描述的问题。来自MVC Action Method

[HttpPost]  // some browsers have URL length limits
[ValidateInput(false)] // or throws HttpRequestValidationException
public ActionResult Index(string xHtml)

    Response.ContentType = "application/pdf";
    Response.AppendHeader(
        "Content-Disposition", "attachment; filename=test.pdf"
    );
    var simpleParser = new SimpleParser();
    simpleParser.Parse(Response.OutputStream, xHtml);

    return new EmptyResult();

或者从Web Form 得到HTML 从server control:

Response.ContentType = "application/pdf";
Response.AppendHeader("Content-Disposition", "attachment; filename=test.pdf");
using (var stringWriter = new StringWriter())

    using (var htmlWriter = new HtmlTextWriter(stringWriter))
    
        ConvertControlToPdf.RenderControl(htmlWriter);
    
    var simpleParser = new SimpleParser();
    simpleParser.Parse(Response.OutputStream, stringWriter.ToString());

Response.End();

或文件系统上带有超链接和图像的简单 HTML 文件:

<h1>HTML Page 00 on Local File System</h1>
<div>
    <div>
        Relative &lt;img&gt;: <img src='Images/alt-gravatar.png' />
    </div>
    <div>
        Hyperlink to file system HTML page: 
        <a href='file-system-html-01.html'>Page 01</a>
    </div>
</div>

或来自远程网站的 HTML:

<div>
    <div>
        <img  
             src="portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png">
    </div>
    <div lang="en">
        <a href="https://en.wikipedia.org/">English</a>
    </div>
    <div lang="en">
        <a href="wiki/IText">iText</a>
    </div>
</div>

以上两个HTML sn-ps 从控制台应用程序运行:

var filePaths = Path.Combine(basePath, "file-system-html-00.html");
var htmlFile = File.ReadAllText(filePaths);
var remoteUrl = Path.Combine(basePath, "wikipedia.html");
var htmlRemote = File.ReadAllText(remoteUrl);
var outputFile = Path.Combine(basePath, "filePaths.pdf");
var outputRemote = Path.Combine(basePath, "remoteUrl.pdf");

using (var stream = new FileStream(outputFile, FileMode.Create))

    var simpleParser = new SimpleParser(basePath);
    simpleParser.Parse(stream, htmlFile);

using (var stream = new FileStream(outputRemote, FileMode.Create))

    var simpleParser = new SimpleParser("https://wikipedia.org");
    simpleParser.Parse(stream, htmlRemote);

相当长的答案,但请查看标记为 html, pdf, and itextsharp 的问题,截至撰写本文时 (2016-02-23),有 776 个结果,总共有 4,063 个标记为 @987654335 @ - 那是 19%

【讨论】:

不错,在这么短的时间内回答这么长……;) @mkl - 哈哈....过去一周左右来回发布问题,然后自己回答。 &lt;sarcasm&gt;Being the eternal optimist &lt;/sarcasm&gt; 我希望这对 tiny 有点帮助 - 即使我已经厌倦了看到版本 1.XX.XXX.XXXX 分拆这个或非常相似的概念。 ;) 很好的@kuujinbo,这将是一个很好的“接近重复”的参考! @ChrisHaas - 谢谢! :) 希望它有助于清理这里的东西。 (itextsharp标签) 这是很好的信息,帮助我理解了the iText documentation。像您的 SimpleParser 这样的东西将成为 iTextSharp 的一部分似乎是合乎逻辑的。您是否有机会考虑将此作为pull request 提交?【参考方案2】:

非常有用的帖子,

我在将报告 html 中的图像渲染为 pdf 时遇到问题。有了你的帖子,我可以做到。

我正在使用 asp.mvc 5。

我只需要改变ImageProviderClass的这个方法

public virtual string GetImageRootPath()  return null; 

public virtual string GetImageRootPath()  HostingEnvironment.MapPath("~/Content/Images/") 

谢谢!

【讨论】:

以上是关于如何使用 iText 将带有图像和超链接的 HTML 转换为 PDF?的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 itext7 Java 将多个图像添加到 PDF?

使用 iText 7 将文本环绕在单元格中的图像周围

弹出网页表单没有按钮和超链接

带有 url 更新和超链接支持的 Flutter web 底部导航栏

一个 QTableView 单元格中带有图像的超链接

android textview 显示带图片和超链接的html,且图片带有超链接可点击跳转