在Java中使用格式将rtf转换为html

Posted

技术标签:

【中文标题】在Java中使用格式将rtf转换为html【英文标题】:Converting rtf to html with format in Java 【发布时间】:2014-04-18 16:30:13 【问题描述】:

我可以使用 JEditorPane 来解析 rtf 文本并将其转换为 html。但是 html 输出缺少某种格式,即本例中的删除线标记。正如您在输出中看到的那样,下划线文本正确地包裹在 中,但没有删除线包裹。有什么想法吗?

public void testRtfToHtml()

    JEditorPane pane = new JEditorPane();
    pane.setContentType("text/rtf");

    StyledEditorKit kitRtf = (StyledEditorKit) pane.getEditorKitForContentType("text/rtf");

    try
    
        kitRtf.read(
            new StringReader(
                "\\rtf1\\ansi \\deflang1033\\deff0\\fonttbl \\f0\\froman \\fcharset0 \\fprq2 Times New Roman;\\colortbl;\\red0\\green0\\blue0; \\stylesheet\\fs20 \\snext0 Normal; \\plain \\fs26 \\strike\\fs26 This is supposed to be strike-through.\\plain \\fs26 \\fs26   \\plain \\fs26 \\ul\\fs26 Underline text here \\plain \\fs26 \\fs26 .\\u698\\'20"),
            pane.getDocument(), 0);
        kitRtf = null;

        StyledEditorKit kitHtml =
            (StyledEditorKit) pane.getEditorKitForContentType("text/html");

        Writer writer = new StringWriter();
        kitHtml.write(writer, pane.getDocument(), 0, pane.getDocument().getLength());
        System.out.println(writer.toString());
    
    catch (Exception e)
    
        e.printStackTrace();
    

输出:

<html>
  <head>
    <style>
      <!--
        p.Normal 
          RightIndent:0.0;
          FirstLineIndent:0.0;
          LeftIndent:0.0;
        
      -->
    </style>
  </head>
  <body>
    <p class=default>
              <span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
This is supposed to be strike-through.
      </span>
      <span style="color: #000000; font-size: 13pt; font-family: Times New Roman">

      </span>
       <span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
<u>Underline text here</u>
      </span>
       <span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
.?
      </span>

    </p>
  </body>
</html>

【问题讨论】:

【参考方案1】:

您可以尝试使用 this converter library 使用 OpenOffice 或 LibreOffice 进行转换,如 in this blog post 所述

【讨论】:

【参考方案2】:

这是我用来将 RTF 从 .msg 正文转换为 HTML 的函数。 在 GitHub 上查看我的 Outlook 消息解析器 yamp 存储库。

public static String rtfToHtml(String rtfText) 
    if (rtfText != null) 
        rtfText = rtfText.replaceAll("\\\\\\\\*\\\\[m]?htmltag[\\d]*(.*)", "$1")
            .replaceAll("\\\\htmlrtf[1]?(.*)\\\\htmlrtf0", "")
            .replaceAll("\\\\htmlrtf[01]?", "")
            .replaceAll("\\\\htmlbase", "")
            .replaceAll("\\\\par", "\n")
            .replaceAll("\\\\tab", "\t")
            .replaceAll("\\\\line", "\n")
            .replaceAll("\\\\page", "\n\n")
            .replaceAll("\\\\sect", "\n\n")
            .replaceAll("\\\\emdash", "&#2014;")
            .replaceAll("\\\\endash", "&#2013;")
            .replaceAll("\\\\emspace", "&#2003;")
            .replaceAll("\\\\enspace", "&#2002;")
            .replaceAll("\\\\qmspace", "&#2005;")
            .replaceAll("\\\\bullet", "&#2022;")
            .replaceAll("\\\\lquote", "&#2018;")
            .replaceAll("\\\\rquote", "&#2019;")
            .replaceAll("\\\\ldblquote", "&#201C;")
            .replaceAll("\\\\rdblquote", "&#201D;")
            .replaceAll("\\\\row", "\n")
            .replaceAll("\\\\cell", "|")
            .replaceAll("\\\\nestcell", "|")
            .replaceAll("([^\\\\])\\", "$1")
            .replaceAll("([^\\\\])", "$1")
            .replaceAll("[\\\\](\\)", "$1")
            .replaceAll("[\\\\]()", "$1")
            .replaceAll("\\\\u([0-9]2,5)", "&#$1;")
            .replaceAll("\\\\'([0-9A-Fa-f]2)", "&#x$1;")
            .replaceAll("\"cid:(.*)@.*\"", "\"$1\"");

        int index = rtfText.indexOf("<html");
        if (index != -1) 
            return rtfText.substring(index);
        
    

    return null;

【讨论】:

【参考方案3】:

由于一些错误,我像这样修改你的功能:

public static String rtfToHtml(String rtfText) 
    StringBuilder sb = new StringBuilder();
    
    if (rtfText != null) 
        String[] lignes = rtfText.split("[\\r\\n]+");
        for (String ligne : lignes) 
            String tempLine = ligne
                .replaceAll("\\\\\\\\*\\\\[m]?htmltag[\\d]*([^]*)\\", "$1")
                .replaceAll("\\\\htmlrtf0([^\\\\]*)\\\\htmlrtf", "$1")
                .replaceAll("\\\\htmlrtf \\(.*)\\\\\\htmlrtf0", "$1")
                .replaceAll("\\\\htmlrtf (.*)\\\\htmlrtf0", "")
                .replaceAll("\\\\htmlrtf[0]?", "")
                .replaceAll("\\\\field\\\\\\\\*\\\\fldinst\\[^]*\\\\", "")
                .replaceAll("\\\\\\fldrslt\\\\cf1\\\\ul([^]*)\\", "$1")
                .replaceAll("\\\\htmlbase", "")
                .replaceAll("\\\\par", "\n")
                .replaceAll("\\\\tab", "\t")
                .replaceAll("\\\\line", "\n")
                .replaceAll("\\\\page", "\n\n")
                .replaceAll("\\\\sect", "\n\n")
                .replaceAll("\\\\emdash", "&#2014;")
                .replaceAll("\\\\endash", "&#2013;")
                .replaceAll("\\\\emspace", "&#2003;")
                .replaceAll("\\\\enspace", "&#2002;")
                .replaceAll("\\\\qmspace", "&#2005;")
                .replaceAll("\\\\bullet", "&#2022;")
                .replaceAll("\\\\lquote", "&#2018;")
                .replaceAll("\\\\rquote", "&#2019;")
                .replaceAll("\\\\ldblquote", "&#201C;")
                .replaceAll("\\\\rdblquote", "&#201D;")
                .replaceAll("\\\\row", "\n")
                .replaceAll("\\\\cell", "|")
                .replaceAll("\\\\nestcell", "|")
                .replaceAll("([^\\\\])\\", "$1")
                .replaceAll("([^\\\\])", "$1")
                .replaceAll("[\\\\](\\)", "$1")
                .replaceAll("[\\\\]()", "$1")
                .replaceAll("\\\\u([0-9]2,5)", "&#$1;")
                .replaceAll("\\\\'([0-9A-Fa-f]2)", "&#x$1;")
                .replaceAll("\"cid:(.*)@.*\"", "\"$1\"")
                .replaceAll(" 2,", " ")
            ;
            
            if (!tempLine.replaceAll("\\s+", "").isEmpty()) 
                sb.append(tempLine).append("\r\n");
            
        
        
        rtfText = sb.toString();

        int index = rtfText.indexOf("<html");
        if (index != -1) 
            return rtfText.substring(index);
        
    

    return null;

【讨论】:

以上是关于在Java中使用格式将rtf转换为html的主要内容,如果未能解决你的问题,请参考以下文章

如何将 RTF 格式转换为字符串 C#

Html 转 Doc(Word) 或 RTF 格式

python模块将doc/pdf/docx/rtf格式转换为文本[重复]

word文件显示乱码怎么办?

有没有在winfrom中可以转换HTML样式的富文本框。

将 Java 代码格式化为 Word / RTF