在Java中使用格式将rtf转换为html
Posted
技术标签:
【中文标题】在Java中使用格式将rtf转换为html【英文标题】:Converting rtf to html with format in Java 【发布时间】:2014-04-18 16:30:13 【问题描述】:我可以使用 JEditorPane 来解析 rtf 文本并将其转换为 html。但是 html 输出缺少某种格式,即本例中的删除线标记。正如您在输出中看到的那样,下划线文本正确地包裹在 中,但没有删除线包裹。有什么想法吗?
public void testRtfToHtml()
JEditorPane pane = new JEditorPane();
pane.setContentType("text/rtf");
StyledEditorKit kitRtf = (StyledEditorKit) pane.getEditorKitForContentType("text/rtf");
try
kitRtf.read(
new StringReader(
"\\rtf1\\ansi \\deflang1033\\deff0\\fonttbl \\f0\\froman \\fcharset0 \\fprq2 Times New Roman;\\colortbl;\\red0\\green0\\blue0; \\stylesheet\\fs20 \\snext0 Normal; \\plain \\fs26 \\strike\\fs26 This is supposed to be strike-through.\\plain \\fs26 \\fs26 \\plain \\fs26 \\ul\\fs26 Underline text here \\plain \\fs26 \\fs26 .\\u698\\'20"),
pane.getDocument(), 0);
kitRtf = null;
StyledEditorKit kitHtml =
(StyledEditorKit) pane.getEditorKitForContentType("text/html");
Writer writer = new StringWriter();
kitHtml.write(writer, pane.getDocument(), 0, pane.getDocument().getLength());
System.out.println(writer.toString());
catch (Exception e)
e.printStackTrace();
输出:
<html>
<head>
<style>
<!--
p.Normal
RightIndent:0.0;
FirstLineIndent:0.0;
LeftIndent:0.0;
-->
</style>
</head>
<body>
<p class=default>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
This is supposed to be strike-through.
</span>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
</span>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
<u>Underline text here</u>
</span>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
.?
</span>
</p>
</body>
</html>
【问题讨论】:
【参考方案1】:您可以尝试使用 this converter library 使用 OpenOffice 或 LibreOffice 进行转换,如 in this blog post 所述
【讨论】:
【参考方案2】:这是我用来将 RTF 从 .msg 正文转换为 HTML 的函数。 在 GitHub 上查看我的 Outlook 消息解析器 yamp 存储库。
public static String rtfToHtml(String rtfText)
if (rtfText != null)
rtfText = rtfText.replaceAll("\\\\\\\\*\\\\[m]?htmltag[\\d]*(.*)", "$1")
.replaceAll("\\\\htmlrtf[1]?(.*)\\\\htmlrtf0", "")
.replaceAll("\\\\htmlrtf[01]?", "")
.replaceAll("\\\\htmlbase", "")
.replaceAll("\\\\par", "\n")
.replaceAll("\\\\tab", "\t")
.replaceAll("\\\\line", "\n")
.replaceAll("\\\\page", "\n\n")
.replaceAll("\\\\sect", "\n\n")
.replaceAll("\\\\emdash", "ߞ")
.replaceAll("\\\\endash", "ߝ")
.replaceAll("\\\\emspace", "ߓ")
.replaceAll("\\\\enspace", "ߒ")
.replaceAll("\\\\qmspace", "ߕ")
.replaceAll("\\\\bullet", "ߦ")
.replaceAll("\\\\lquote", "ߢ")
.replaceAll("\\\\rquote", "ߣ")
.replaceAll("\\\\ldblquote", "ÉC;")
.replaceAll("\\\\rdblquote", "ÉD;")
.replaceAll("\\\\row", "\n")
.replaceAll("\\\\cell", "|")
.replaceAll("\\\\nestcell", "|")
.replaceAll("([^\\\\])\\", "$1")
.replaceAll("([^\\\\])", "$1")
.replaceAll("[\\\\](\\)", "$1")
.replaceAll("[\\\\]()", "$1")
.replaceAll("\\\\u([0-9]2,5)", "&#$1;")
.replaceAll("\\\\'([0-9A-Fa-f]2)", "&#x$1;")
.replaceAll("\"cid:(.*)@.*\"", "\"$1\"");
int index = rtfText.indexOf("<html");
if (index != -1)
return rtfText.substring(index);
return null;
【讨论】:
【参考方案3】:由于一些错误,我像这样修改你的功能:
public static String rtfToHtml(String rtfText)
StringBuilder sb = new StringBuilder();
if (rtfText != null)
String[] lignes = rtfText.split("[\\r\\n]+");
for (String ligne : lignes)
String tempLine = ligne
.replaceAll("\\\\\\\\*\\\\[m]?htmltag[\\d]*([^]*)\\", "$1")
.replaceAll("\\\\htmlrtf0([^\\\\]*)\\\\htmlrtf", "$1")
.replaceAll("\\\\htmlrtf \\(.*)\\\\\\htmlrtf0", "$1")
.replaceAll("\\\\htmlrtf (.*)\\\\htmlrtf0", "")
.replaceAll("\\\\htmlrtf[0]?", "")
.replaceAll("\\\\field\\\\\\\\*\\\\fldinst\\[^]*\\\\", "")
.replaceAll("\\\\\\fldrslt\\\\cf1\\\\ul([^]*)\\", "$1")
.replaceAll("\\\\htmlbase", "")
.replaceAll("\\\\par", "\n")
.replaceAll("\\\\tab", "\t")
.replaceAll("\\\\line", "\n")
.replaceAll("\\\\page", "\n\n")
.replaceAll("\\\\sect", "\n\n")
.replaceAll("\\\\emdash", "ߞ")
.replaceAll("\\\\endash", "ߝ")
.replaceAll("\\\\emspace", "ߓ")
.replaceAll("\\\\enspace", "ߒ")
.replaceAll("\\\\qmspace", "ߕ")
.replaceAll("\\\\bullet", "ߦ")
.replaceAll("\\\\lquote", "ߢ")
.replaceAll("\\\\rquote", "ߣ")
.replaceAll("\\\\ldblquote", "ÉC;")
.replaceAll("\\\\rdblquote", "ÉD;")
.replaceAll("\\\\row", "\n")
.replaceAll("\\\\cell", "|")
.replaceAll("\\\\nestcell", "|")
.replaceAll("([^\\\\])\\", "$1")
.replaceAll("([^\\\\])", "$1")
.replaceAll("[\\\\](\\)", "$1")
.replaceAll("[\\\\]()", "$1")
.replaceAll("\\\\u([0-9]2,5)", "&#$1;")
.replaceAll("\\\\'([0-9A-Fa-f]2)", "&#x$1;")
.replaceAll("\"cid:(.*)@.*\"", "\"$1\"")
.replaceAll(" 2,", " ")
;
if (!tempLine.replaceAll("\\s+", "").isEmpty())
sb.append(tempLine).append("\r\n");
rtfText = sb.toString();
int index = rtfText.indexOf("<html");
if (index != -1)
return rtfText.substring(index);
return null;
【讨论】:
以上是关于在Java中使用格式将rtf转换为html的主要内容,如果未能解决你的问题,请参考以下文章