Saxon 转换因 java.lang.OutOfMemoryError 失败:Java 堆空间错误

Posted

技术标签:

【中文标题】Saxon 转换因 java.lang.OutOfMemoryError 失败:Java 堆空间错误【英文标题】:Saxon transformation is failed with java.lang.OutOfMemoryError: Java heap space error 【发布时间】:2021-11-23 09:07:27 【问题描述】:

我正在尝试使用 SAXON EE 库中的流式传输功能转换大约 13 GB 的大文件。 并尝试将转换后的结果存储到流中,然后将此流数据发送到 S3。

如果我使用 ByteArrayOutputStream 对象存储来自 trans.transform(streamSource, new StreamResult(output_stream)) 方法的 StreamResult,则会出现内存不足错误。

    /Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/bin/java -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0 "-javaagent:/Applications/IntelliJ IDEA CE.app/Contents/lib/idea_rt.jar=55781:/Applications/IntelliJ IDEA CE.app/Contents/bin" -Dfile.encoding=UTF-8 -classpath /Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/deploy.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/jfxrt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/javaws.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/jfxswt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/management-agent.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/plugin.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/lib/ant-javafx.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/lib/javafx-mx.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/lib/jconsole.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/lib/packager.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/lib/sa-jdi.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/lib/tools.jar:/Users/gobinathgopalsamy/IdeaProjects/saxon-transform-poc/out/production/saxon-transform-poc:/Users/gobinathgopalsamy/Downloads/SaxonEE10-5J/saxon-ee-10.5.jar TransformWorker
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at net.sf.saxon.serialize.UTF8Writer.write(UTF8Writer.java:292)
    at net.sf.saxon.serialize.UTF8Writer.write(UTF8Writer.java:259)
    at net.sf.saxon.serialize.XMLEmitter.writeEscape(XMLEmitter.java:895)
    at net.sf.saxon.serialize.XMLEmitter.writeAttribute(XMLEmitter.java:589)
    at net.sf.saxon.serialize.XMLEmitter.attribute(XMLEmitter.java:503)
    at net.sf.saxon.serialize.XMLEmitter.startElement(XMLEmitter.java:423)
    at net.sf.saxon.event.NamespaceDifferencer.startElement(NamespaceDifferencer.java:71)
    at net.sf.saxon.event.ProxyReceiver.startElement(ProxyReceiver.java:139)
    at net.sf.saxon.event.SequenceNormalizer.startElement(SequenceNormalizer.java:84)
    at net.sf.saxon.event.ComplexContentOutputter.startElement(ComplexContentOutputter.java:530)
    at net.sf.saxon.event.ProxyOutputter.startElement(ProxyOutputter.java:108)
    at net.sf.saxon.event.ProxyOutputter.startElement(ProxyOutputter.java:108)
    at net.sf.saxon.event.ProxyOutputter.startElement(ProxyOutputter.java:108)
    at net.sf.saxon.event.ProxyOutputter.startElement(ProxyOutputter.java:108)
    at net.sf.saxon.tree.tiny.TinyElementImpl.copy(TinyElementImpl.java:389)
    at com.saxonica.ee.stream.feed.ComplexNodeEventFeed.append(ComplexNodeEventFeed.java:86)
    at com.saxonica.ee.stream.adjunct.BlockAdjunct$BlockFeed.append(BlockAdjunct.java:100)
    at com.saxonica.ee.stream.watch.ForEachAction$$Lambda$78/1204296383.accept(Unknown Source)
    at net.sf.saxon.om.SequenceIterator.forEachOrFail(SequenceIterator.java:136)
    at com.saxonica.ee.stream.watch.ForEachAction.append(ForEachAction.java:169)
    at com.saxonica.ee.stream.feed.NoOpenOrCloseFeed.append(NoOpenOrCloseFeed.java:38)
    at com.saxonica.ee.stream.feed.ItemFeed$$Lambda$77/405896924.accept(Unknown Source)
    at net.sf.saxon.om.SequenceIterator.forEachOrFail(SequenceIterator.java:136)
    at com.saxonica.ee.stream.feed.ItemFeed.processItems(ItemFeed.java:113)
    at com.saxonica.ee.stream.feed.AbsorptionFeed.endSelectedParentNode(AbsorptionFeed.java:86)
    at com.saxonica.ee.stream.watch.Trigger.endSelectedParentNode(Trigger.java:101)
    at com.saxonica.ee.stream.watch.WatchManager.endElement(WatchManager.java:527)
    at com.saxonica.ee.stream.ContentDetector.endElement(ContentDetector.java:47)

Process finished with exit code 1

示例代码。

import com.saxonica.config.StreamingTransformerFactory;
import net.sf.saxon.Configuration;
import net.sf.saxon.TransformerFactoryImpl;

import net.sf.saxon.s9api.*;

import javax.xml.transform.*;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import java.io.ByteArrayOutputStream;
import java.io.File;

public class TransformWorker 
    public static void main(String args[]) throws TransformerException, SaxonApiException 
        File file = new File("files/feed.xml"); // this is 13 GB file
        Source streamSource = new StreamSource(file);       
        TransformerFactory factory =new StreamingTransformerFactory();
        Configuration config = ((TransformerFactoryImpl)factory).getConfiguration();
        config.isLicensedFeature(Configuration.LicenseFeature.ENTERPRISE_XSLT);
        factory.setAttribute("http://saxon.sf.net/feature/licenseFileLocation","saxon-license.lic");
        File sheet = new File("files/feed.xsl");        
        Templates templates = factory.newTemplates(new StreamSource(sheet));     

        ByteArrayOutputStream output_stream = new ByteArrayOutputStream();
        Transformer trans =templates.newTransformer();
        trans.setOutputProperty(Serializer.Property.ENCODING.toString(),"UTF-8");       
        trans.setOutputProperty(Serializer.Property.METHOD.toString(),"xml");
        trans.transform(streamSource, new StreamResult(output_stream)); 
        // send the stream result to S3
    

请帮我解决这个问题。

【问题讨论】:

所以输入是 13 GB,您希望从要放入 ByteArrayOutputStream 的输出中创建什么样的输出大小?看起来您的 Java 代码并没有通过大量输入 XML 耗尽内存流,而是在将输出(部分)写入该 ByteArrayOutputStream 时。我不熟悉你提到的基础设施(S3,亚马逊),有没有办法直接写入 FileOutputStream?您可以从命令行使用 Saxon EE 运行转换吗?为 Amazon/S3 添加标签以查看人们是否知道如何编写大量内容可能会有所帮助 send the stream result to S3 究竟做了什么,这不是带有请求流的 HTTP PUT,您可以在上面构造 StreamResult? @MartinHonnen 如果我使用 FileOutputStream 它将文件写入磁盘,那么我需要再次读取文件以发送到 S3。有没有更好的方法可以在不写入文件然后读取内容的情况下做到这一点? 我不熟悉 S3,所以我不知道您的评论 // send the stream result to S3 指的是什么以及是否或如何不使用 ByteArrayOutputStream。似乎有一些用于 S3 的 REST PUT API,不确定您是否可以在此类 PUT 请求的主体的请求流上使用 StreamResult。我希望其他人能告诉我,现在已经添加了 Amazon-S3 的标签。 【参考方案1】:

和 Martin 一样,我对 S3 API 了解不多。撒克逊流媒体转型似乎正在奏效。问题是将结果流式传输到 S3。据我所知,亚马逊的 PutObjectRequest 允许您提供文件或 InputStream,我猜您正在做的是将数据写入内存缓冲区,以便您可以将输入流连接到该缓冲区。但缓冲区内存不足。

一种解决方案是使用文件将数据缓冲在磁盘上而不是内存中。

另一种解决方案是使用两个线程:让 Saxon 在一个线程中写入 PipedOutputStream,而您的 Amazon 上传从另一个线程中的相应 PipedInputStream 读取。 (同样,恐怕我没有任何使用这些类的个人经验。在https://flylib.com/books/en/1.134.1/communicating_between_threads_using_piped_streams.html 有一个描述它们的小教程)

【讨论】:

以上是关于Saxon 转换因 java.lang.OutOfMemoryError 失败:Java 堆空间错误的主要内容,如果未能解决你的问题,请参考以下文章

BizTalk 2020 Saxon XSLT 3.0 转换异常

使用 Saxon 对 Msi 文件 (xml) 运行 XSL 转换

编译样式表时的未知函数 saxon:parse-html

Saxon HE - XSLT 转换 - 尝试运行命令行代码时出错

无法将 element(transactionDate, xs:anyType) 类型的值类 net.sf.saxon.tinytree.TinyElementImpl 转换为类 java.util.

C# 将 Saxon XSLT 转换结果显示到当前 ASP.NET 容器页面