从Word（Docx）读取方程式和公式到html并使用java保存数据库

Posted 2021-04-09

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了从Word（Docx）读取方程式和公式到html并使用java保存数据库相关的知识，希望对你有一定的参考价值。

我有一个单词/ docx文件，其中包含图像下的方程式

我想要读取文件word / docx的数据并保存到我的数据库中，当需要时我可以从数据库中获取数据并在我的html页面上显示我使用apache Poi读取数据格式docx文件但它不能取公式请帮助我！

答案

Word *.docx文件是包含ZIP文件的XML档案，这些文件是Office Open XML。 Word *.docx文件中包含的公式是Office MathML (OMML)。

不幸的是，这种XML格式在Microsoft Office之外并不是很有名。因此，它不能直接用于HTML。但幸运的是它是XML，因此可以使用Transforming XML Data with XSLT进行转换。因此，我们可以将OMML转换为MathML，例如，可用于更广泛的用例区域。

通过XSLT的转换过程主要基于转换的XSL定义。不幸的是，创建这样的东西也不是很容易。但幸运的是Microsoft已经这样做了，如果你安装了当前的Microsoft Office，你可以在OMML2MML.XSL的Microsoft Office程序目录中找到这个文件%ProgramFiles%。如果您没有找到它，请进行网络研究以获得它。

因此，如果我们知道这一切，我们可以从OMML获取XWPFDocument，将其转换为MathML然后保存以供以后使用。

我的例子将找到的公式作为MathML存储在ArrayList的字符串中。您还应该能够在数据库中存储此字符串。

该示例需要ooxml-schemas-1.3.jar中提到的完整https://poi.apache.org/faq.html#faq-N10025。这是因为它使用CTOMath，而poi-ooxml-schemas jar没有附带。

Word文档：

Java代码：

import java.io.*;
import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;
import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;
import java.util.ArrayList;

/*
needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025
*/

public class WordReadFormulas {

 static File stylesheet = new File("OMML2MML.XSL");
 static TransformerFactory tFactory = TransformerFactory.newInstance();
 static StreamSource stylesource = new StreamSource(stylesheet); 

 static String getMathML(CTOMath ctomath) throws Exception {
  Transformer transformer = tFactory.newTransformer(stylesource);

  Node node = ctomath.getDomNode();

  DOMSource source = new DOMSource(node);
  StringWriter stringwriter = new StringWriter();
  StreamResult result = new StreamResult(stringwriter);
  transformer.setOutputProperty("omit-xml-declaration", "yes");
  transformer.transform(source, result);

  String mathML = stringwriter.toString();
  stringwriter.close();

  //The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.
  //We don't need this since we want using the MathML in HTML, not in XML.
  //So ideally we should changing the OMML2MML.XSL to not do so.
  //But to take this example as simple as possible, we are using replace to get rid of the XML specialities.
  mathML = mathML.replaceAll("xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"", "");
  mathML = mathML.replaceAll("xmlns:mml", "xmlns");
  mathML = mathML.replaceAll("mml:", "");

  return mathML;
 }

 public static void main(String[] args) throws Exception {

  XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

  //storing the found MathML in a AllayList of strings
  List<String> mathMLList = new ArrayList<String>();

  //getting the formulas out of all body elements
  for (IBodyElement ibodyelement : document.getBodyElements()) {
   if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {
    XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;
    for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
     mathMLList.add(getMathML(ctomath));
    }
    for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
     for (CTOMath ctomath : ctomathpara.getOMathList()) {
      mathMLList.add(getMathML(ctomath));
     }
    }
   } else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {
    XWPFTable table = (XWPFTable)ibodyelement; 
    for (XWPFTableRow row : table.getRows()) {
     for (XWPFTableCell cell : row.getTableCells()) {
      for (XWPFParagraph paragraph : cell.getParagraphs()) {
       for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {
        mathMLList.add(getMathML(ctomath));
       }
       for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {
        for (CTOMath ctomath : ctomathpara.getOMathList()) {
         mathMLList.add(getMathML(ctomath));
        }
       }
      }
     }
    }
   }
  }

  document.close();

  //creating a sample HTML file 
  String encoding = "UTF-8";
  FileOutputStream fos = new FileOutputStream("result.html");
  OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);
  writer.write("<!DOCTYPE html>
");
  writer.write("<html lang="en">");
  writer.write("<head>");
  writer.write("<meta charset="utf-8"/>");

  //using MathJax for helping all browsers to interpret MathML
  writer.write("<script type="text/javascript"");
  writer.write(" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=MML_CHTML"");
  writer.write(">");
  writer.write("</script>");

  writer.write("</head>");
  writer.write("<body>");
  writer.write("<p>Following formulas was found in Word document: </p>");

  int i = 1;
  for (String mathML : mathMLList) {
   writer.write("<p>Formula" + i++ + ":</p>");
   writer.write(mathML);
   writer.write("<p/>");
  }

  writer.write("</body>");
  writer.write("</html>");
  writer.close();

  Desktop.getDesktop().browse(new File("result.html").toURI());

 }
}

结果：

另一答案

我遇到了使用MathML支持从MSWord .docx文件转换为.html的相同问题，因为在使用MSoffice API时，方程式会自动转换为png文件。当我无法找到解决方案时，我决定编写自己的shell和python脚本。它使用latex和MathML进行转换，但需要在您的计算机上安装MSOffice和Libre Office。

通过我的回购https://github.com/Adityaraj1711/word-to-html

根据您的用途自定义脚本。这是有据可查的。

生成的HTML可以被删除，因此您可以将其作为文本字段轻松保存在数据库中。

以上是关于从Word（Docx）读取方程式和公式到html并使用java保存数据库的主要内容，如果未能解决你的问题，请参考以下文章

小技巧公式从docx文件复制到doc文件变成了图片怎么办？

读取word文档并提取和写入数据（基于python 3.6）

Python：读取 .doc.docx 两种 Word 文件简述及“Word 未能引发事件”错误

使用Python操纵具有链接和跟踪更改的Microsoft Word DOCX文件

使用docx4j编程式地创建复杂的Word(.docx)文档

怎么在Word 2003上读取.docx文件?