Java 8 流不可预测的性能下降，没有明显的原因

Posted 2023-04-18

技术标签:

【中文标题】Java 8 流不可预测的性能下降，没有明显的原因【英文标题】：Java 8 stream unpredictable performance drop with no obvious reason 【发布时间】：2015-08-21 15:42:45 【问题描述】：

我正在使用 Java 8 流来迭代带有子列表的列表。外部列表大小在 100 到 1000 之间变化（不同的测试运行），内部列表大小始终为 5。

有 2 次基准运行显示出意外的性能偏差。

package benchmark;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.io.IOException;
import java.util.concurrent.ThreadLocalRandom;
import java.util.*;
import java.util.function.*;
import java.util.stream.*;

@Threads(32)
@Warmup(iterations = 25)
@Measurement(iterations = 5)
@State(Scope.Benchmark)
@Fork(1)
@BenchmarkMode(Mode.Throughput)
public class StreamBenchmark 
    @Param("700", "600", "500", "400", "300", "200", "100")
    int outerListSizeParam;
    final static int INNER_LIST_SIZE = 5;
    List<List<Integer>> list;

    Random rand() 
        return ThreadLocalRandom.current();
    

    final BinaryOperator<Integer> reducer = (val1, val2) -> val1 + val2;

    final Supplier<List<Integer>> supplier = () -> IntStream
            .range(0, INNER_LIST_SIZE)
            .mapToObj(ptr -> rand().nextInt(100))
            .collect(Collectors.toList());

    @Setup
    public void init() throws IOException 
        list = IntStream
                .range(0, outerListSizeParam)
                .mapToObj(i -> supplier.get())
                .collect(Collectors.toList());
    

    @Benchmark
    public void loop(Blackhole bh) throws Exception 
        List<List<Integer>> res = new ArrayList<>();
        for (List<Integer> innerList : list) 
            if (innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000)) 
                res.add(innerList);
            
        

        bh.consume(res);
    

    @Benchmark
    public void stream(Blackhole bh) throws Exception 
        List<List<Integer>> res = list
                .stream()
                .filter(innerList -> innerList.stream().reduce(reducer).orElse(0) == rand().nextInt(2000))
                .collect(Collectors.toList());

        bh.consume(res);

运行 1

    Benchmark               (outerListSizeParam)   Mode  Cnt        Score        Error  Units
    StreamBenchmark.loop                     700  thrpt    5    22488.601 ?   1128.543  ops/s
    StreamBenchmark.loop                     600  thrpt    5    26010.430 ?   1161.854  ops/s
    StreamBenchmark.loop                     500  thrpt    5   361837.395 ?  12777.016  ops/s
    StreamBenchmark.loop                     400  thrpt    5   451774.457 ?  22517.801  ops/s
    StreamBenchmark.loop                     300  thrpt    5   744677.723 ?  23456.913  ops/s
    StreamBenchmark.loop                     200  thrpt    5  1102075.707 ?  38678.994  ops/s
    StreamBenchmark.loop                     100  thrpt    5  2334981.090 ? 100973.551  ops/s
    StreamBenchmark.stream                   700  thrpt    5    22320.346 ?    496.432  ops/s
    StreamBenchmark.stream                   600  thrpt    5    26091.609 ?   1044.868  ops/s
    StreamBenchmark.stream                   500  thrpt    5    31961.096 ?    497.854  ops/s
    StreamBenchmark.stream                   400  thrpt    5   377701.859 ?  11115.990  ops/s
    StreamBenchmark.stream                   300  thrpt    5    53887.652 ?   1228.245  ops/s
    StreamBenchmark.stream                   200  thrpt    5    78754.294 ?   2173.316  ops/s
    StreamBenchmark.stream                   100  thrpt    5  1564899.788 ?  47369.698  ops/s

运行 2

    Benchmark               (outerListSizeParam)   Mode  Cnt        Score       Error  Units
    StreamBenchmark.loop                    1000  thrpt   10    16179.702 ?   260.134  ops/s
    StreamBenchmark.loop                     700  thrpt   10    22924.319 ?   329.134  ops/s
    StreamBenchmark.loop                     600  thrpt   10    26871.267 ?   416.464  ops/s
    StreamBenchmark.loop                     500  thrpt   10   353043.221 ?  6628.980  ops/s
    StreamBenchmark.loop                     300  thrpt   10   772234.261 ? 10075.536  ops/s
    StreamBenchmark.loop                     100  thrpt   10  2357125.442 ? 30824.834  ops/s
    StreamBenchmark.stream                  1000  thrpt   10    15526.423 ?   147.454  ops/s
    StreamBenchmark.stream                   700  thrpt   10    22347.898 ?   117.360  ops/s
    StreamBenchmark.stream                   600  thrpt   10    26172.790 ?   229.745  ops/s
    StreamBenchmark.stream                   500  thrpt   10    31643.518 ?   428.680  ops/s
    StreamBenchmark.stream                   300  thrpt   10   536037.041 ?  6176.192  ops/s
    StreamBenchmark.stream                   100  thrpt   10   153619.054 ?  1450.839  ops/s

我有两个问题：

为什么两个测试运行的 loop+500 和 loop+600 之间存在一致且显着的性能差异？为什么在 Run1 stream+400 和 Run2 stream+300 中存在显着但不一致的性能偏差？

看起来 JIT 有时会做出次优的优化决策，从而导致性能大幅下降。

测试机有 128GB RAM 和 32 个 CPU 核心：

    java version "1.8.0_45"
    Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
    Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                32
    On-line CPU(s) list:   0-31
    Thread(s) per core:    2
    Core(s) per socket:    8
    Socket(s):             2
    NUMA node(s):          2
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 62
    Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
    Stepping:              4
    CPU MHz:               1201.078
    CPU max MHz:           3400.0000
    CPU min MHz:           1200.0000
    BogoMIPS:              5201.67
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              20480K
    NUMA node0 CPU(s):     0-7,16-23
    NUMA node1 CPU(s):     8-15,24-31

附：添加了没有流的基准。这些测试（循环 + 流 + 纯循环）让我认为使用流和 lambda 需要大量的微优化工作，并且无论如何不能保证一致的性能。

@Benchmark
public void pureLoop(Blackhole bh) throws Exception 
    List<List<Integer>> res = new ArrayList<>();
    for (List<Integer> innerList : list) 
        int sum = 0;
        for (Integer i : innerList) 
            sum += i;
        
        if (sum == rand().nextInt(2000))
            res.add(innerList);
    

    bh.consume(res);

运行 3（纯循环）

Benchmark                 (outerListSizeParam)   Mode  Cnt         Score        Error  Units
StreamBenchmark.loop                      1000  thrpt    5     15848.277 ?    445.624  ops/s
StreamBenchmark.loop                       700  thrpt    5     22330.289 ?    484.554  ops/s
StreamBenchmark.loop                       600  thrpt    5     26353.565 ?    631.421  ops/s
StreamBenchmark.loop                       500  thrpt    5    358144.956 ?   8273.981  ops/s
StreamBenchmark.loop                       400  thrpt    5    591471.382 ?  17725.212  ops/s
StreamBenchmark.loop                       300  thrpt    5    785458.022 ?  23775.650  ops/s
StreamBenchmark.loop                       200  thrpt    5   1192328.880 ?  40006.056  ops/s
StreamBenchmark.loop                       100  thrpt    5   2330555.766 ?  73143.081  ops/s
StreamBenchmark.pureLoop                  1000  thrpt    5   1024629.128 ?   4387.106  ops/s
StreamBenchmark.pureLoop                   700  thrpt    5   1495365.029 ?  31659.941  ops/s
StreamBenchmark.pureLoop                   600  thrpt    5   1787432.825 ?  16611.868  ops/s
StreamBenchmark.pureLoop                   500  thrpt    5   2087093.023 ?  20143.165  ops/s
StreamBenchmark.pureLoop                   400  thrpt    5   2662946.999 ?  33326.079  ops/s
StreamBenchmark.pureLoop                   300  thrpt    5   3657830.227 ?  55020.775  ops/s
StreamBenchmark.pureLoop                   200  thrpt    5   5365706.786 ?  64404.783  ops/s
StreamBenchmark.pureLoop                   100  thrpt    5  10477430.730 ? 187641.413  ops/s
StreamBenchmark.stream                    1000  thrpt    5     15576.304 ?    250.620  ops/s
StreamBenchmark.stream                     700  thrpt    5     22286.965 ?   1153.734  ops/s
StreamBenchmark.stream                     600  thrpt    5     26109.258 ?    296.382  ops/s
StreamBenchmark.stream                     500  thrpt    5     31343.986 ?   1270.210  ops/s
StreamBenchmark.stream                     400  thrpt    5     39696.775 ?   1812.355  ops/s
StreamBenchmark.stream                     300  thrpt    5    536932.353 ?  41249.909  ops/s
StreamBenchmark.stream                     200  thrpt    5     77797.301 ?    976.641  ops/s
StreamBenchmark.stream                     100  thrpt    5    155387.348 ?   3182.841  ops/s

解决方案：按照 apangin 的建议禁用 tiered compilation 使 JIT 结果稳定。

java -XX:-TieredCompilation -jar test-jmh.jar

Benchmark                 (outerListSizeParam)   Mode  Cnt         Score        Error  Units
StreamBenchmark.loop                      1000  thrpt    5    160410.288 ?   4426.320  ops/s
StreamBenchmark.loop                       700  thrpt    5    230524.018 ?   4426.740  ops/s
StreamBenchmark.loop                       600  thrpt    5    266266.663 ?   9078.827  ops/s
StreamBenchmark.loop                       500  thrpt    5    324182.307 ?   8452.368  ops/s
StreamBenchmark.loop                       400  thrpt    5    400793.677 ?  12526.475  ops/s
StreamBenchmark.loop                       300  thrpt    5    534618.231 ?  25616.352  ops/s
StreamBenchmark.loop                       200  thrpt    5    803314.614 ?  33108.005  ops/s
StreamBenchmark.loop                       100  thrpt    5   1827400.764 ?  13868.253  ops/s
StreamBenchmark.pureLoop                  1000  thrpt    5   1126873.129 ?  33307.600  ops/s
StreamBenchmark.pureLoop                   700  thrpt    5   1560200.150 ? 150146.319  ops/s
StreamBenchmark.pureLoop                   600  thrpt    5   1848113.823 ?  16195.103  ops/s
StreamBenchmark.pureLoop                   500  thrpt    5   2250201.116 ? 130995.240  ops/s
StreamBenchmark.pureLoop                   400  thrpt    5   2839212.063 ? 142008.523  ops/s
StreamBenchmark.pureLoop                   300  thrpt    5   3807436.825 ? 140612.798  ops/s
StreamBenchmark.pureLoop                   200  thrpt    5   5724311.256 ?  77031.417  ops/s
StreamBenchmark.pureLoop                   100  thrpt    5  11718427.224 ? 101424.952  ops/s
StreamBenchmark.stream                    1000  thrpt    5     16186.121 ?    249.806  ops/s
StreamBenchmark.stream                     700  thrpt    5     22071.884 ?    703.729  ops/s
StreamBenchmark.stream                     600  thrpt    5     25546.378 ?    472.804  ops/s
StreamBenchmark.stream                     500  thrpt    5     32271.659 ?    437.048  ops/s
StreamBenchmark.stream                     400  thrpt    5     39755.841 ?    506.207  ops/s
StreamBenchmark.stream                     300  thrpt    5     52309.706 ?   1271.206  ops/s
StreamBenchmark.stream                     200  thrpt    5     79277.532 ?   2040.740  ops/s
StreamBenchmark.stream                     100  thrpt    5    161244.347 ?   3882.619  ops/s

【问题讨论】：

当我在stream 用forEach(res::add) 替换Collectors.toList 时，他们开始使用相同的性能。如果您想研究 codegen，请查看 -XX:+PrintAssembly。当您在 NUMA 机器上运行时，您可以尝试仅在一个套接字上运行，看看趋势是否相同。确保禁用涡轮增压以进行基准测试。如果您的 CPU 以 2.6 GHz 或 3.4 GHz 运行，它确实会有所不同。还可以考虑添加一个真正的基于循环的版本，它也不在内部列表上使用减速器。 ;-) 不确定涡轮增压是否会导致 10 倍的性能差距，在每次估计之前，jvm 都会预热 25 次迭代。 @user3707125 - 您为 JVM 使用了哪些内存选项？尝试对-Xmx 和-Xms 使用不同的选项，看看使用Collectors.toList 对你的基准测试wnen 有什么影响。 【参考方案1】：

此效果是由Type Profile Pollution 引起的。让我解释一个简化的基准：

@State(Scope.Benchmark)
public class Streams 
    @Param("500", "520")
    int iterations;

    @Setup
    public void init() 
        for (int i = 0; i < iterations; i++) 
            Stream.empty().reduce((x, y) -> x);
        
    

    @Benchmark
    public long loop() 
        return Stream.empty().count();

虽然这里的iteration 参数变化非常小，并且不会影响主基准循环，但结果显示出非常令人惊讶的 2.5 倍性能下降：

Benchmark     (iterations)   Mode  Cnt      Score     Error   Units
Streams.loop           500  thrpt    5  29491,039 ± 240,953  ops/ms
Streams.loop           520  thrpt    5  11867,860 ± 344,779  ops/ms

现在让我们使用 -prof perfasm 选项运行 JMH 以查看最热门的代码区域：

快速案例（迭代次数 = 500）：

....[Hottest Methods (after inlining)]..................................
 48,66%  bench.generated.Streams_loop::loop_thrpt_jmhStub
 23,14%  <unknown>
  2,99%  java.util.stream.Sink$ChainedReference::<init>
  1,98%  org.openjdk.jmh.infra.Blackhole::consume
  1,68%  java.util.Objects::requireNonNull
  0,65%  java.util.stream.AbstractPipeline::evaluate

慢情况（迭代次数 = 520）：

....[Hottest Methods (after inlining)]..................................
 40,09%  java.util.stream.ReduceOps$ReduceOp::evaluateSequential
 22,02%  <unknown>
 17,61%  bench.generated.Streams_loop::loop_thrpt_jmhStub
  1,25%  org.openjdk.jmh.infra.Blackhole::consume
  0,74%  java.util.stream.AbstractPipeline::evaluate

看起来慢速案例在未内联的 ReduceOp.evaluateSequential 方法中花费的时间最多。此外，如果我们研究这个方法的汇编代码，我们会发现最长的操作是checkcast。

你知道 HotSpot 编译器是如何工作的：在 JIT 启动之前，一个方法在解释器中执行一段时间以收集 profile data，例如调用了哪些方法，查看了哪些类，采用了哪些分支等。通过分层编译，配置文件也收集在 C1 编译的代码中。然后使用该配置文件生成 C2 优化代码。但是，如果应用程序在中间更改了执行模式，则生成的代码对于修改后的行为可能不是最佳的。

让我们使用-XX:+PrintMethodData（在调试 JVM 中可用）来比较执行配置文件：

----- Fast case -----
java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
  interpreter_invocation_count:    13382 
  invocation_counter:              13382 
  backedge_counter:                    0 
  mdo size: 552 bytes

0 aload_1
1 fast_aload_0
2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 
  0   bci: 2    VirtualCallData     count(0) entries(1)
                                    'java/util/stream/ReduceOps$8'(12870 1.00)
5 aload_2
6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 
  48  bci: 6    VirtualCallData     count(0) entries(1)
                                    'java/util/stream/ReferencePipeline$5'(12870 1.00)
9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
  96  bci: 9    ReceiverTypeData    count(0) entries(1)
                                    'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 
  144 bci: 12   VirtualCallData     count(0) entries(1)
                                    'java/util/stream/ReduceOps$8ReducingSink'(12870 1.00)
17 areturn

----- Slow case -----
java.util.stream.ReduceOps$ReduceOp::evaluateSequential(Ljava/util/stream/PipelineHelper;Ljava/util/Spliterator;)Ljava/lang/Object;
  interpreter_invocation_count:    54751 
  invocation_counter:              54751 
  backedge_counter:                    0 
  mdo size: 552 bytes

0 aload_1
1 fast_aload_0
2 invokevirtual 3 <java/util/stream/ReduceOps$ReduceOp.makeSink()Ljava/util/stream/ReduceOps$AccumulatingSink;> 
  0   bci: 2    VirtualCallData     count(0) entries(2)
                                    'java/util/stream/ReduceOps$2'(16 0.00)
                                    'java/util/stream/ReduceOps$8'(54223 1.00)
5 aload_2
6 invokevirtual 4 <java/util/stream/PipelineHelper.wrapAndCopyInto(Ljava/util/stream/Sink;Ljava/util/Spliterator;)Ljava/util/stream/Sink;> 
  48  bci: 6    VirtualCallData     count(0) entries(2)
                                    'java/util/stream/ReferencePipeline$Head'(16 0.00)
                                    'java/util/stream/ReferencePipeline$5'(54223 1.00)
9 checkcast 5 <java/util/stream/ReduceOps$AccumulatingSink>
  96  bci: 9    ReceiverTypeData    count(0) entries(2)
                                    'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
                                    'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
12 invokeinterface 6 <java/util/stream/ReduceOps$AccumulatingSink.get()Ljava/lang/Object;> 
  144 bci: 12   VirtualCallData     count(0) entries(2)
                                    'java/util/stream/ReduceOps$2ReducingSink'(16 0.00)
                                    'java/util/stream/ReduceOps$8ReducingSink'(54228 1.00)
17 areturn

您看，初始化循环运行时间过长，以至于其统计信息出现在执行配置文件中：所有虚拟方法都有两个实现，而 checkcast 也有两个不同的条目。在快速情况下，配置文件不会被污染：所有站点都是单态的，JIT 可以轻松内联和优化它们。

对于您的原始基准测试也是如此：init() 方法中较长的流操作污染了配置文件。如果您使用配置文件和分层编译选项，结果可能会大不相同。例如，尝试

-XX:-ProfileInterpreter

-XX:Tier3InvocationThreshold=1000

-XX:-TieredCompilation

最后，这个问题并不是唯一的。由于配置文件污染，已经有多个与性能回归相关的 JVM 错误：JDK-8015416、JDK-8015417、JDK-8059879...希望这将在 Java 9 中得到改进。

【讨论】：

妈的，说的很详细。 你知道 HotSpot 编译器是如何工作的 不是真的，所以谢谢你的解释。使用 -XX:ReservedCodeCacheSize 也会很有趣感谢@apangin 的回答。这是否已经用 Java 9 解决了？

以上是关于Java 8 流不可预测的性能下降，没有明显的原因的主要内容，如果未能解决你的问题，请参考以下文章