王家林谈Spark性能优化第九季之Spark Tungsten内存使用彻底解密
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了王家林谈Spark性能优化第九季之Spark Tungsten内存使用彻底解密相关的知识,希望对你有一定的参考价值。
内容:
1、到底什么是Page;
2、Page具体的两种实现方式;
3、Page的使用的源码详解;
==========Tungsten中到底什么是Page============
1、在Spark中其实是不存在Page这个类的!!!实质上来说,Page是一种数据结构(类似于Stack、List等),从OS的层面来讲,Page代表了一个内存块在Page里面可以存放数据,在OS会存在很多不同的Page,当要获得数据的时候,首先要定位具体是哪个Page中的数据,找到该Page之后,从Page中根据特定的规则(例如说数据的offset和length等)取出数据;
2、那到底什么是Spark中的Page?在阅读源码的时候研究MemoryBlock.java,是从TaskMemoryManager.java进去看到的发现MemoryBlock就是Page
public class MemoryBlock extends MemoryLocation {
private final long length;
/**
* Optional page number; used when this MemoryBlock represents a page allocated by a
* TaskMemoryManager. This field is public so that it can be modified by the TaskMemoryManager,
* which lives in a different package.
*/
public int pageNumber = -1;
public MemoryBlock(@Nullable Object obj, long offset, long length) {
super(obj, offset);
this.length = length;
}
/**
* Returns the size of the memory block.
*/
public long size() {
return length;
}
/**
* Creates a memory block pointing to the memory used by the long array.
*/
public static MemoryBlock fromLongArray(final long[] array) {
return new MemoryBlock(array, Platform.LONG_ARRAY_OFFSET, array.length * 8);
}
}
MemoryBlock表示Page,里面的数据可能是on-heap或者off-heap的,所以上面的构造函数的第一个参数,是可以为空,on-heap是有对象的,off-heap是无对象的
如果On-heap的方式,内存的分配是有HeapMemoryAllocator完成的
/**
* A simple {@link MemoryAllocator} that can allocate up to 16GB using a JVM long primitive array.
*/
public class HeapMemoryAllocator implements MemoryAllocator {
@GuardedBy("this")
private final Map<Long, LinkedList<WeakReference<MemoryBlock>>> bufferPoolsBySize =
new HashMap<>();
private static final int POOLING_THRESHOLD_BYTES = 1024 * 1024;
/**
* Returns true if allocations of the given size should go through the pooling mechanism and
* false otherwise.
*/
private boolean shouldPool(long size) {
// Very small allocations are less likely to benefit from pooling.
return size >= POOLING_THRESHOLD_BYTES;
}
@Override
public MemoryBlock allocate(long size) throws OutOfMemoryError {
if (shouldPool(size)) {
synchronized (this) {
final LinkedList<WeakReference<MemoryBlock>> pool = bufferPoolsBySize.get(size);
if (pool != null) {
while (!pool.isEmpty()) {
final WeakReference<MemoryBlock> blockReference = pool.pop();
final MemoryBlock memory = blockReference.get();
if (memory != null) {
assert (memory.size() == size);
return memory;
}
}
bufferPoolsBySize.remove(size);
}
}
}
long[] array = new long[(int) ((size + 7) / 8)];
return new MemoryBlock(array, Platform.LONG_ARRAY_OFFSET, size);
}
如果Off-heap的方式,内存的分配是有UnsafeMemoryAllocator完成的
/**
* A simple {@link MemoryAllocator} that uses {@code Unsafe} to allocate off-heap memory.
*/
public class UnsafeMemoryAllocator implements MemoryAllocator {
@Override
public MemoryBlock allocate(long size) throws OutOfMemoryError {
long address = Platform.allocateMemory(size);
return new MemoryBlock(null, address, size);
}
==========如何使用Page呢============
1、在TaskMemoryManager中通过封装Page来定位数据,定位的时候如果是on-heap的话,则先找到对象,然后再对象中通过offset来具体定位地址,而如果是off-heap,则直接定位;
2、一个关键的问题是,如何确定数据呢?这个时候就需要设计具体的算法
TaskMemoryManager中已经写好准备以后一台机器拥有32T大小的内存了
王家林老师名片:
中国Spark第一人
新浪微博:http://weibo.com/ilovepains
微信公众号:DT_Spark
博客:http://blog.sina.com.cn/ilovepains
手机:18610086859
QQ:1740415547
本文出自 “一枝花傲寒” 博客,谢绝转载!
以上是关于王家林谈Spark性能优化第九季之Spark Tungsten内存使用彻底解密的主要内容,如果未能解决你的问题,请参考以下文章
王家林谈Spark性能优化第八季之Spark Tungsten-sort Based Shuffle 内幕解密