06-Leveldb实现-sstable

Posted anda0109

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了06-Leveldb实现-sstable相关的知识,希望对你有一定的参考价值。

sstable文件格式如下:

<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1]
...
[meta block K]
[metaindex block]
[index block]
[Footer]        (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>

文件包含了内部的指针. 每个指针称为一个BlockHandle,包含了如下信息:

offset:   varint64
size:     varint64

See varints for an explanation of varint64 format.

  1. key/value键值对有序地存储在一系列的数据块(data block)中。 数据块从文件的开始存储,一个块挨着一个块。每个数据块通过 block_builder.cc的方式进行格式化, 可选择地进行压缩。

  2. 在数据块之后存储元数据块(meta blocks)。支持的meta block类型描述如下。未来可能增加更多的类型。每个meta block也是通过block_builder.cc进行组织,可选择地进行压缩。

  3. A “metaindex” block. It contains one entry for every other meta
    block where the key is the name of the meta block and the value is a
    BlockHandle pointing to that meta block.

  4. An “index” block. This block contains one entry per data block,
    where the key is a string >= last key in that data block and before
    the first key in the successive data block. The value is the
    BlockHandle for the data block.

  5. At the very end of the file is a fixed length footer that contains
    the BlockHandle of the metaindex and index blocks as well as a magic number.

     metaindex_handle: char[p];     // Block handle for metaindex
     index_handle:     char[q];     // Block handle for index
     padding:          char[40-p-q];// zeroed bytes to make fixed length
                                    // (40==2*BlockHandle::kMaxEncodedLength)
     magic:            fixed64;     // == 0xdb4775248b80fb57 (little-endian)
    

“filter” Meta Block

If a FilterPolicy was specified when the database was opened, a
filter block is stored in each table. The “metaindex” block contains
an entry that maps from filter.<N> to the BlockHandle for the filter
block where <N> is the string returned by the filter policy’s
Name() method.

The filter block stores a sequence of filters, where filter i contains
the output of FilterPolicy::CreateFilter() on all keys that are stored
in a block whose file offset falls within the range

[ i*base ... (i+1)*base-1 ]

Currently, “base” is 2KB. So for example, if blocks X and Y start in
the range [ 0KB .. 2KB-1 ], all of the keys in X and Y will be
converted to a filter by calling FilterPolicy::CreateFilter(), and the
resulting filter will be stored as the first filter in the filter
block.

The filter block is formatted as follows:

[filter 0]
[filter 1]
[filter 2]
...
[filter N-1]

[offset of filter 0]                  : 4 bytes
[offset of filter 1]                  : 4 bytes
[offset of filter 2]                  : 4 bytes
...
[offset of filter N-1]                : 4 bytes

[offset of beginning of offset array] : 4 bytes
lg(base)                              : 1 byte

The offset array at the end of the filter block allows efficient
mapping from a data block offset to the corresponding filter.

“stats” Meta Block

This meta block contains a bunch of stats. The key is the name
of the statistic. The value contains the statistic.

TODO(postrelease): record following stats.

data size
index size
key size (uncompressed)
value size (uncompressed)
number of entries
number of data blocks

sstable构建的主要代码如下:

class LEVELDB_EXPORT TableBuilder 
 public:
  // Create a builder that will store the contents of the table it is
  // building in *file.  Does not close the file.  It is up to the
  // caller to close the file after calling Finish().
  TableBuilder(const Options& options, WritableFile* file);

  TableBuilder(const TableBuilder&) = delete;
  TableBuilder& operator=(const TableBuilder&) = delete;

  // REQUIRES: Either Finish() or Abandon() has been called.
  ~TableBuilder();

  // Change the options used by this builder.  Note: only some of the
  // option fields can be changed after construction.  If a field is
  // not allowed to change dynamically and its value in the structure
  // passed to the constructor is different from its value in the
  // structure passed to this method, this method will return an error
  // without changing any fields.
  Status ChangeOptions(const Options& options);

  // Add key,value to the table being constructed.
  // REQUIRES: key is after any previously added key according to comparator.
  // REQUIRES: Finish(), Abandon() have not been called
  void Add(const Slice& key, const Slice& value);

  // Advanced operation: flush any buffered key/value pairs to file.
  // Can be used to ensure that two adjacent entries never live in
  // the same data block.  Most clients should not need to use this method.
  // REQUIRES: Finish(), Abandon() have not been called
  void Flush();

  // Return non-ok iff some error has been detected.
  Status status() const;

  // Finish building the table.  Stops using the file passed to the
  // constructor after this function returns.
  // REQUIRES: Finish(), Abandon() have not been called
  Status Finish();

  // Indicate that the contents of this builder should be abandoned.  Stops
  // using the file passed to the constructor after this function returns.
  // If the caller is not going to call Finish(), it must call Abandon()
  // before destroying this builder.
  // REQUIRES: Finish(), Abandon() have not been called
  void Abandon();

  // Number of calls to Add() so far.
  uint64_t NumEntries() const;

  // Size of the file generated so far.  If invoked after a successful
  // Finish() call, returns the size of the final generated file.
  uint64_t FileSize() const;

 private:
  bool ok() const  return status().ok(); 
  void WriteBlock(BlockBuilder* block, BlockHandle* handle);
  void WriteRawBlock(const Slice& data, CompressionType, BlockHandle* handle);

  struct Rep;
  Rep* rep_;
;

其中void Add(const Slice& key, const Slice& value)方法便是向其中添加key/value键值对。当table结束时调用Finish()方法完成sstable的构建,代码如下:

Status TableBuilder::Finish() 

  Rep* r = rep_;
  Flush();// Write data block
  assert(!r->closed);
  r->closed = true;

  BlockHandle filter_block_handle, metaindex_block_handle, index_block_handle;

  // Write filter meta block
  if (ok() && r->filter_block != nullptr) 
  
    WriteRawBlock(r->filter_block->Finish(), kNoCompression, &filter_block_handle);
  

  // Write metaindex block
  if (ok()) 
  
    BlockBuilder meta_index_block(&r->options);
    if (r->filter_block != nullptr) 
    
      // Add mapping from "filter.Name" to location of filter data
      std::string key = "filter.";
      key.append(r->options.filter_policy->Name());
      std::string handle_encoding;
      filter_block_handle.EncodeTo(&handle_encoding);
      meta_index_block.Add(key, handle_encoding);
    

    // TODO(postrelease): Add stats and other meta blocks
    WriteBlock(&meta_index_block, &metaindex_block_handle);
  

  // Write index block
  if (ok()) 
  
    if (r->pending_index_entry) 
    
      r->options.comparator->FindShortSuccessor(&r->last_key);
      std::string handle_encoding;
      r->pending_handle.EncodeTo(&handle_encoding);
      r->index_block.Add(r->last_key, Slice(handle_encoding));
      r->pending_index_entry = false;
    
    WriteBlock(&r->index_block, &index_block_handle);
  

  // Write footer
  if (ok()) 
  
    Footer footer;
    footer.set_metaindex_handle(metaindex_block_handle);
    footer.set_index_handle(index_block_handle);
    std::string footer_encoding;
    footer.EncodeTo(&footer_encoding);
    r->status = r->file->Append(footer_encoding);
    if (r->status.ok()) 
    
      r->offset += footer_encoding.size();
    
  
  return r->status;

可以看到在Finish函数中,我们首先写入data block(当然在前面Add方法中当一个data block达到32K大小时也会进行写入),然后写入filter meta block,紧接着写入meta index block,再写入index block,最后写入footer。一个sstable就构建完成了。

以上是关于06-Leveldb实现-sstable的主要内容,如果未能解决你的问题,请参考以下文章

06-Leveldb实现-sstable

想要将当前集群数据分布在更多和更小的 sstable 上

万字长文使用 LSM Tree 思想实现一个 KV 数据库

LevelDB 源码剖析SSTable模块:SSTableBlock布隆过滤器LRU Cache

LevelDB 源码剖析SSTable模块:SSTableBlock布隆过滤器LRU Cache

LevelDB 源码剖析SSTable模块:SSTableBlock布隆过滤器LRU Cache