如何构建 HTML org.w3c.dom.Document？

Posted 2023-03-05

技术标签:

【中文标题】如何构建 HTML org.w3c.dom.Document？【英文标题】：How can I build an HTML org.w3c.dom.Document? 【发布时间】：2015-05-16 12:04:28 【问题描述】：

documentation of the Document interface 将接口描述为：

Document 接口代表整个 html 或 XML 文档。

javax.xml.parsers.DocumentBuilder 构建 XML Documents。但是，我无法找到一种方法来构建一个Document，它是一个 HTML Document！

我想要一个 HTML Document，因为我正在尝试构建一个文档，然后我将其传递给一个需要 HTML Document 的库。这个库以不区分大小写的方式使用Document#getElementsByTagName(String tagname)，这对 HTML 来说很好，但对 XML 则不然。

我环顾四周，没有找到任何东西。像How to convert an Html source of a webpage into org.w3c.dom.Document in java? 这样的项目实际上没有答案。

【问题讨论】：

您可能有可用的 XMLSerializer。 xerces.apache.org/xerces-j/apiDocs/org/apache/xml/serialize/… 我想我要找的是xerces.apache.org/xerces-j/apiDocs/org/apache/html/dom/…。不过还不确定。 @dimadima 起初我也是这么想的，但现在不这么想了。稍后我会尝试写一个答案，解释原因和可能的替代方案。 @dimadima 我发布了迄今为止我发现的答案。如果我发现更多或更正，我将编辑我的答案。 【参考方案1】：

您似乎有两个明确的要求：

org.w3c.dom.Document

Document#getElementsByTagName(String tagname)

如果您尝试使用org.w3c.dom.Document 处理HTML，那么我假设您正在使用某种XHTML 风格。因为 XML API（例如 DOM）需要格式良好的 XML。 HTML 不一定是格式良好的 XML，但 XHTML 是格式良好的 XML。即使您使用的是 HTML，在尝试通过 XML 解析器运行它之前，您也必须进行一些预处理以确保它是格式良好的 XML。首先使用 HTML 解析器（例如 jsoup）解析 HTML 可能会更容易，然后通过遍历 HTML 解析器的生成树（在 jsoup 的情况下为 org.jsoup.nodes.Document）构建您的 org.w3c.dom.Document。

有一个org.w3c.dom.html.HTMLDocument 接口，它扩展了org.w3c.dom.Document。我发现的唯一实现是在Xerces-j (2.11.0) 中，形式为org.apache.html.dom.HTMLDocumentImpl。起初这似乎很有希望，但经过仔细研究，我们发现存在一些问题。

1.获取实现org.w3c.dom.html.HTMLDocument 接口的对象实例并没有一种清晰、“干净”的方式。

使用 Xerces，我们通常会以下列方式使用 DocumentBuilder 获取 Document 对象：

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file

或使用DOMImplementation 变体：

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");

在这两种情况下，我们纯粹是使用org.w3c.dom.* 接口来获取Documentobject。

我为 HTMLDocument 找到的最接近的等价物是这样的：

HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

这需要我们直接实例化内部实现类，使我们的实现依赖于 Xerces。

（注意：我还看到 Xerces 也有一个内部 HTMLBuilder（它实现了已弃用的 DocumentHandler），据说可以生成一个 HTMLDocument using a SAX parser, but I didn't bother looking into it.）

2。 org.w3c.dom.html.HTMLDocument 不会生成正确的 XHTML。

虽然您可以使用getElementsByTagName(String tagname) 以不区分大小写的方式搜索HTMLDocument 树，但所有元素名称都在内部保存为全部大写。但是 XHTML 元素和属性名称是应该在all lowercase。（这可以通过遍历整个文档树并使用Document 的renameNode() 方法将所有元素的名称更改为小写来解决。）

此外，XHTML 文档应该有正确的DOCTYPE declaration 和xmlns declaration for the XHTML namespace 。似乎没有一种直接的方法可以在 HTMLDocument 中设置它们（除非您对内部 Xerces 实现进行了一些摆弄）。

3。 org.w3c.dom.html.HTMLDocument 的文档很少，接口的 Xerces 实现似乎不完整。

我没有在整个 Internet 上搜索，但我为 HTMLDocument 找到的唯一文档是之前链接的 JavaDocs，以及 Xerces 内部实现的源代码中的 cmets。在这些 cmets 中，我还发现接口的几个不同部分没有实现。 （旁注：我的印象是org.w3c.dom.html.HTMLDocument 接口本身并没有真正被任何人使用，而且可能本身并不完整。）

出于这些原因，我认为最好避免使用org.w3c.dom.html.HTMLDocument，并尽我们所能使用org.w3c.dom.Document。我们能做什么？

一种方法是扩展org.apache.xerces.dom.DocumentImpl（扩展org.apache.xerces.dom.CoreDocumentImpl，实现org.w3c.dom.Document）。这种方法不需要太多代码，但它仍然使我们的实现依赖于 Xerces，因为我们正在扩展 DocumentImpl。在我们的MyHTMLDocumentImpl 中，我们只是在元素创建和搜索时将所有标签名称转换为小写。这将允许以不区分大小写的方式使用Document#getElementsByTagName(String tagname)。

MyHTMLDocumentImpl:

import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl 

    private static final long serialVersionUID = 1658286253541962623L;


    /**
     * Creates an Document with basic elements required to meet
     * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
     * <pre>
     * @code
     * <?xml version="1.0" encoding="UTF-8"?>
     * <!DOCTYPE html 
     *     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     *     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
     * <html xmlns="http://www.w3.org/1999/xhtml">
     *     <head>
     *         <title>My Title</title>
     *     </head>
     *     <body/>
     * </html>
     * 
     * </pre>
     * 
     * @param title desired text content for title tag. If null, no text will be added.
     * @return basic HTML Document. 
     */
    public static Document makeBasicHtmlDoc(String title) 
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType docType = new DocumentTypeImpl(null, "html",
                "-//W3C//DTD XHTML 1.0 Strict//EN",
                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
        htmlDoc.appendChild(docType);
        Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
        htmlDoc.appendChild(htmlElement);
        Element headElement = htmlDoc.createElement("head");
        htmlElement.appendChild(headElement);
        Element titleElement = htmlDoc.createElement("title");
        if(title != null)
            titleElement.setTextContent(title);
        headElement.appendChild(titleElement);
        Element bodyElement = htmlDoc.createElement("body");
        htmlElement.appendChild(bodyElement);

        return htmlDoc;
    

    /**
     * This method will allow us to create a our
     * MyHTMLDocumentImpl from an existing Document.
     */
    public static Document createFrom(Document doc) 
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType originDocType = doc.getDoctype();
        if(originDocType != null) 
            DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
                    originDocType.getPublicId(),
                    originDocType.getSystemId());
            htmlDoc.appendChild(docType);
        
        Node docElement = doc.getDocumentElement();
        if(docElement != null) 
            Node copiedDocElement = docElement.cloneNode(true);
            htmlDoc.adoptNode(copiedDocElement);
            htmlDoc.appendChild(copiedDocElement);
        
        return htmlDoc;
    

    private MyHTMLDocumentImpl() 
        super();
    

    @Override
    public Element createElement(String tagName) throws DOMException 
        return super.createElement(tagName.toLowerCase());
    

    @Override
    public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException 
        return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
    

    @Override
    public NodeList getElementsByTagName(String tagname) 
        return super.getElementsByTagName(tagname.toLowerCase());
    

    @Override
    public NodeList getElementsByTagNameNS(String namespaceURI, String localName) 
        return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
    

    @Override
    public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException 
        return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());

测试人员：

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest 

    private final static int P_ELEMENT_NUM = 3;

    public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
            throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException 

        Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");

        //populate the html doc with some example content
        Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
        for(int i = 0; i < P_ELEMENT_NUM; ++i) 
            Element pElement = htmlDoc.createElement("p");
            String id = Integer.toString(i+1);
            pElement.setAttribute("id", "anId"+id);
            pElement.setTextContent("Here is some text"+id+".");
            bodyElement.appendChild(pElement);
        

        //get the title element in a case insensitive manner.
        NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
        for(int i = 0; i < titleNodeList.getLength(); ++i)
            System.out.println(titleNodeList.item(i).getTextContent());

        System.out.println();

        //get all p elements searching with lowercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("p");
            for(int i = 0; i < pNodeList.getLength(); ++i) 
                System.out.println(pNodeList.item(i).getTextContent());
            
        

        System.out.println();

        //get all p elements searching with uppercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("P");
            for(int i = 0; i < pNodeList.getLength(); ++i) 
                System.out.println(pNodeList.item(i).getTextContent());
            
        

        System.out.println();

        //to serialize
        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");

        LSSerializer lsSerializer = domImplLS.createLSSerializer();
        DOMConfiguration domConfig = lsSerializer.getDomConfig();
        domConfig.setParameter("format-pretty-print", true);  //if you want it pretty and indented

        LSOutput lsOutput = domImplLS.createLSOutput();
        lsOutput.setEncoding("UTF-8");

        //to write to file
        try (OutputStream os = new FileOutputStream(new File("myFile.html"))) 
            lsOutput.setByteStream(os);
            lsSerializer.write(htmlDoc, lsOutput);
        

        //to print to screen
        System.out.println(lsSerializer.writeToString(htmlDoc));

输出：

My Title

Here is some text1.
Here is some text2.
Here is some text3.

Here is some text1.
Here is some text2.
Here is some text3.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

与上述类似的另一种方法是创建一个 Document 包装器，该包装器包装 Document 对象并实现 Document 接口本身。这需要比“扩展DocumentImpl”方法更多的代码，但这种方式“更干净”，因为我们不必关心特定的Document 实现。这种方法的额外代码并不难。为Document 方法提供所有这些包装器实现只是有点乏味。我还没有完全解决这个问题，可能会有一些问题，但如果它有效，这是一般的想法：

public class MyHTMLDocumentWrapper implements Document 

    private Document doc;

    public MyHTMLDocumentWrapper(Document doc) 
        //...
        this.doc = doc;
        //...
    

    //...

无论是org.w3c.dom.html.HTMLDocument，我上面提到的方法之一，还是其他，也许这些建议将帮助您了解如何进行。

编辑：

在我尝试解析以下 XHTML 文件的解析测试中，Xerces 会在试图打开 http 连接的实体管理类中挂起。为什么我不知道？特别是因为我在没有实体的本地 html 文件上进行了测试。（可能与 DOCTYPE 或命名空间有关？）这是文档：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

【讨论】：

您好 dbank，感谢您的回答！实际上，我首先通过 jsoup 运行原始 html 以构建 org.jsoup.nodes.Document。然后我通过遍历节点和 jsoup Document 并为 xerces 2 HTMLDocumentImplementation 创建类似的节点，将其转换为 org.w3c.dom.Document。无论如何，在那一刻，这一切对我来说都太令人讨厌了，我什至从未测试过它是否在区分大小写的查询方面有效:)。谢谢您的回答！真的很感激。 @dimadima：我刚刚进行了编辑。 MyHTMLDocumentImpl.createFrom(Document doc) 实际上似乎工作正常。但是 Xerces DOM 解析器似乎一直在解析示例 XHTML 文件。 @dimadima：无论如何，这一切都需要您自担风险。不过，我希望它有所帮助。 :-) 我想我现在知道为什么 Xerces DOM 解析器似乎一直在解析示例 XHTML 文件。它实际上最终在长时间挂起后解析。当我有时间时，我会尝试用解释和可能的解决方案编辑答案。

以上是关于如何构建 HTML org.w3c.dom.Document？的主要内容，如果未能解决你的问题，请参考以下文章