ElasticsearchElasticsearch 7.3 的 offheap 原理

Posted 2021-08-01 九师兄

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了ElasticsearchElasticsearch 7.3 的 offheap 原理相关的知识，希望对你有一定的参考价值。

在这里插入图片描述

1.概述

一直以来，ES 堆中常驻内存中占据比重最大是 FST，即 tip(terms index) 文件占据的空间，1TB 索引大约占用2GB 或者更多的内存，因此为了节点稳定运行，业界通常认为一个节点 open 的索引不超过5TB。现在，从 ES 7.3版本开始，将 tip 文件修改为通过 mmap 的方式加载，这使 FST占据的内存从堆内转移到了堆外由操作系统的 pagecache 管理。

参考 ES 7.3 的 release-notes ：

Also mmap terms index (.tip) files for hybridfs #43150 (issue: #42838)

现在我们来聊一聊其中的一些细节。

hybridfs 是索引默认的store 类型，他根据操作系统类型自动选择 nio 或者 mmap，那么究竟哪些文件被 mmap 方式打开，手册中说：

Currently only the Lucene term dictionary, norms and doc values files are memory mapped. All other files are opened using Lucene NIOFSDirectory\\

对应到文件扩展名，就是 nvd(norms),dvd(doc values),tim(term dictionary),tip(term index),cfs(compound)类型的文件使用 mmap 方式加载，其余使用 nio：

文件：org.elasticsearch.index.store.FsDirectoryFactory.HybridDirectory#openInput

方法useDelegate

boolean useDelegate(String name) {
            String extension = FileSwitchDirectory.getExtension(name);
            switch(extension) {
                // Norms, doc values and term dictionaries are typically performance-sensitive and hot in the page
                // cache, so we use mmap, which provides better performance.
                case "nvd":
                case "dvd":
                case "tim":
                // We want to open the terms index and KD-tree index off-heap to save memory, but this only performs
                // well if using mmap.
                case "tip":
                case "dim":
                // Compound files are tricky because they store all the information for the segment. Benchmarks
                // suggested that not mapping them hurts performance.
                case "cfs":
                // MMapDirectory has special logic to read long[] arrays in little-endian order that helps speed
                // up the decoding of postings. The same logic applies to positions (.pos) of offsets (.pay) but we
                // are not mmaping them as queries that leverage positions are more costly and the decoding of postings
                // tends to be less a bottleneck.
                case "doc":
                    return true;
                // Other files are either less performance-sensitive (e.g. stored field index, norms metadata)
                // or are large and have a random access pattern and mmap leads to page cache trashing
                // (e.g. stored fields and term vectors).
                default:
                    return false;
            }
        }

方法：openInput

@Override
public IndexInput openInput(String name, IOContext context) throws IOException {
    if (useDelegate(name)) {
        // we need to do these checks on the outer directory since the inner doesn't know about pending deletes
        ensureOpen();
        ensureCanRead(name);
        // we only use the mmap to open inputs. Everything else is managed by the NIOFSDirectory otherwise
        // we might run into trouble with files that are pendingDelete in one directory but still
        // listed in listAll() from the other. We on the other hand don't want to list files from both dirs
        // and intersect for perf reasons.
        return delegate.openInput(name, context);
    } else {
        return super.openInput(name, context);
    }
}

你可能会想，为什么把 tip文件通过 mmap 方式读取就实现 offheap 了？像 hbase 实现 offheap 要把数据转移到堆外的数据结构，为什么 ES 不需要？

2.FST 的查找过程

onheap 的情况下，Lucene将 tip 文件的的数据读进一个数组，在 FST 查找时，seek 到某个位置，读取一些字节，然后再次 seek，再读取，相当于边读取边解析。

FST::findTargetArc

private Arc<T> findTargetArc(BytesReader in,...)
{
    //....
    in.setPosition(follow.target);
    arc.numArcs = in.readVInt();
    arc.bytesPerArc = in.readVInt();
    arc.posArcsStart = in.getPosition();
    arc.nextArc = arc.posArcsStart;
    //....
}

在 onheap 的情况下，这个 BytesReader 的初始化就是简单地将文件读进数组而已

OnHeapFSTStore::init

public void init(DataInput in, long numBytes)
{
    bytesArray = new byte[(int) numBytes];
    in.readBytes(bytesArray, 0, bytesArray.length);
}

因此，在 offheap 的情况下，mmap 像数组一样读取就可以了。如果想要查看文件被 pagecache 缓存的百分比，可以用 vmtouch（推荐），pcstat，hcache，或 fincore等工具来检查：

在这里插入图片描述

如果想要确认某个 tip 文件是否被 mmap 方式读取的，可以使用 pmap 命令，被 mmap 映射的文件会在这里列出来：

在这里插入图片描述

3. tip offheap 后的效果

使用 geonames 数据集写入索引 1TB，使用 _cat/segments API 查看 segments.memory内存占用量，对比 offheap 后的内存占用效果：

store.type	segments.memory
niofs	4.7GB
hybridfs	1.06GB

JVM 内存占用量降低了78%左右，不同数据样本结果不同，其他的可能会降低更多。

通过 _cat/segments 观测到的 segments.memory指标，会比实际占用的 JVM内存少一些，不过相差不大，上述结果可作为参考。

由于 offheap 后的堆外内存由操作系统的 pagecache 管理，什么时候被驱逐出去由操作系统决定，进程无法控制。如果 tip 文件的内容被驱逐出 pagecache，对 FST 的查找会涉及到磁盘 io，对查询延迟有比较大的影响。

Linux 系统的 pagecache回收有两种情况，一是当系统 free 内存不足的时候，系统自动回收 pagecache缓存的数据，其中可能包括 mmap 映射的 tip文件。

另一种情况是通过改写 /proc/sys/vm/drop_caches 或 posix_fadvise调用来手工回收，此时如果索引处于 open 状态，由 mmap 映射到 pagecache 的 tip 数据并不会被回收。而如果索引处于 close 状态，则会被完全回收。

当 FST 查找过程涉及到磁盘 io 时，查询延迟会比较大，不过目前还无法获取到查询过程有多少时间耗费在磁盘 io，只能从 profile API 看到 create_weight时间变长。

在 linux 2.6.34 的内核中，对 pagecache 的回收策略使用双链策略，参考《Linux内核设计与实现第三版》，算法描述大致如下：

该算法引入两个链表,一个 active list，一个 inactive list，两个链表都是从尾部加入，头部移出，页面换出操作只在inactive list执行，对于文件缓存，当第一次访问的时候加入到inactive list，再次访问的时候把他提升到active list，当 active list大小大于inactive list，就将active list头部的页面降级到inactive list

更多 pagecache 的信息可以参考：https://linux-mm.org/PageReplacementDesign

依据 mmap 的原理，文件 fd被映射为指针（或者说字节数组）供进程直接访问，仅在进程访问到相应位置的时候才去读取磁盘，是根据内容按需读取磁盘。你会想既然如此，_open 索引是不是变快了？原来 nio 需要把整个文件读进堆内存，现在 mmap 一下就结束了，那么等索引首次被查询的时候才会加载到 pagecache？实际上 _open 索引并没变快，因为在 _open 索引的过程中，Lucene 会检查文件的校验和，把整个文件读取一遍：

BlockTreeTermsReader:BlockTreeTermsReader()-> CodecUtil.checksumEntireFile(indexIn);

ChecksumIndexInput in = new BufferedChecksumIndexInput(clone);
//读取文件到目标位置，并更新校验和，居然命名 seek 函数，即使 Lucene 大牛我也忍不住要吐槽
in.seek(in.length() - footerLength());
return checkFooter(in);

4.关于 _id 字段要不要 offheap 的问题

Lucene 支持字段级的 offheap设置，ES 7.3中将 tip offheap时并不包含 _id 字段，#52518 中提到，因为担心降低写入速度。不过在经历了一些测试之后发现影响并不大。

In general, the indexing rate is only affected if explicit IDs are used, as
otherwise Elasticsearch almost never performs lookups in the terms
dictionary for the purpose of indexing. So it’s quite wasteful to
require the terms index of _id to be loaded on-heap for users who have
append-only workloads. Furthermore I’ve been conducting benchmarks when
indexing with explicit ids on the http_logs dataset that suggest that
the slowdown is low enough that it’s probably not worth forcing the terms
index to be kept on-heap

题外话：这段内容说使用外部 doc id 方式入库时需要从 term dictionary中查询，这是因为使用外部 id写入时，ES 需要判断该 id 是否存在，以便执行 update 或 append 操作。因此在分片中对 _id 字段执行 Lucene 的 seekExact 查询来判断此 id 是否存在，所以使用外部 id 入库时写入速度会比较低一些（20%左右）。这也是 _id 字段需要写入的 FST 的一个原因。

在将 _id 字段 offheap 之后，使用 http_logs 数据集和外部 id 的方式执行写入测试，写入速度降低了 1.8%，JVM 内存降低了 100倍：

因此在 ES 7.7版本中会将 _id 字段也放到堆外。

结束语

把 FST 放到堆外可以让节点能够持有更多的数据，这对ES 集群能处理的数据规模有重大提升，意义重大。但是 tip 文件需要加载到内存的意义比 tim等文件要重要地多，pagecache 总会有需要回收的时候，谁能保证 tip 不被回收呢？所以总体来说让查询延迟增加不确定性，且不便重现和诊断。不过也用太担心，这种情况一般很少发生。

感谢军义、张鑫刚@小米对若干问题的探讨。

参考

https://github.com/elastic/elasticsearch/issues/38390
https://github.com/elastic/elasticsearch/pull/42838
https://github.com/elastic/elasticsearch/pull/43150
https://github.com/elastic/elasticsearch/pull/52518
https://www.elastic.co/guide/en/elasticsearch/reference/current/release-notes-7.3.0.html

以上是关于ElasticsearchElasticsearch 7.3 的 offheap 原理的主要内容，如果未能解决你的问题，请参考以下文章