用于非常大的 XML 文件的 SAX 解析器

Posted 2023-02-24

技术标签:

【中文标题】用于非常大的 XML 文件的 SAX 解析器【英文标题】：SAX parser for a very huge XML file 【发布时间】：2011-08-06 18:07:07 【问题描述】：

我正在处理一个非常大的 XML 文件，4 GB，而且我总是遇到内存不足的错误，我的 Java 堆已经达到最大值，这就是代码的原因：

Handler h1 = new Handler("post");
        Handler h2 = new Handler("comment");
        posts = new Hashtable<Integer, Posts>();
        comments = new Hashtable<Integer, Comments>();
        edges = new Hashtable<String, Edges>();
         try 
                output = new BufferedWriter(new FileWriter("gephi.gdf"));
                SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
                SAXParser parser1 = SAXParserFactory.newInstance().newSAXParser();


                parser.parse(new File("G:\\posts.xml"), h1);
                parser1.parse(new File("G:\\comments.xml"), h2);
             catch (Exception ex) 
                ex.printStackTrace();
            

    @Override
         public void startElement(String uri, String localName, String qName, 
                    Attributes atts) throws SAXException 
                if(qName.equalsIgnoreCase("row") && type.equals("post")) 
                    post = new Posts();
                    post.id = Integer.parseInt(atts.getValue("Id"));
                    post.postTypeId = Integer.parseInt(atts.getValue("PostTypeId"));
                    if (atts.getValue("AcceptedAnswerId") != null)
                        post.acceptedAnswerId = Integer.parseInt(atts.getValue("AcceptedAnswerId"));
                    else
                        post.acceptedAnswerId = -1;
                    post.score = Integer.parseInt(atts.getValue("Score"));
                    if (atts.getValue("OwnerUserId") != null)
                        post.ownerUserId = Integer.parseInt(atts.getValue("OwnerUserId"));
                    else
                        post.ownerUserId = -1;
                    if (atts.getValue("ParentId") != null)
                        post.parentId = Integer.parseInt(atts.getValue("ParentId"));
                    else
                        post.parentId = -1;
                
                else if(qName.equalsIgnoreCase("row") && type.equals("comment")) 
                    comment = new Comments();
                    comment.id = Integer.parseInt(atts.getValue("Id"));
                    comment.postId = Integer.parseInt(atts.getValue("PostId"));
                    if (atts.getValue("Score") != null)
                        comment.score = Integer.parseInt(atts.getValue("Score"));
                    else
                        comment.score = -1;
                    if (atts.getValue("UserId") != null)
                        comment.userId = Integer.parseInt(atts.getValue("UserId"));
                    else
                        comment.userId = -1;
                
            



public void endElement(String uri, String localName, String qName) 
         throws SAXException 
             if(qName.equalsIgnoreCase("row") && type.equals("post")) 
                 posts.put(post.id, post);
                 //System.out.println("Size of hash table is " + posts.size());
             else if (qName.equalsIgnoreCase("row") && type.equals("comment"))
                 comments.put(comment.id, comment);

有没有办法优化这段代码，以免内存不足？可能使用流？如果是，你会怎么做？

【问题讨论】：

【参考方案1】：

SAX 解析器对故障是有效的。

帖子、cmets 和边缘 HashMap 立即作为潜在问题向我跳出来。我怀疑您需要定期将这些映射从内存中清除，以避免 OOME。

【讨论】：

是的......让我们在内存中构建巨大的数据结构，但归咎于 SAX。你如何定期冲洗这些？ @EquinoX 要刷新，您需要暂停每个 X 元素并将数据写入 JVM 之外的某个位置（例如数据库、磁盘文件等），并为下一批清除映射。我不同意，不支持 XPath，对任何稍微复杂的冗长乏味的条件检查会导致复杂且无法维护的代码【参考方案2】：

看看一个名为 SaxDoMix http://www.devsphere.com/xml/saxdomix/的项目

它允许您解析大型 XML 文件，并将某些元素作为解析后的 DOM 实体返回。比 purs SAX 解析器更容易使用。

【讨论】：

以上是关于用于非常大的 XML 文件的 SAX 解析器的主要内容，如果未能解决你的问题，请参考以下文章