04-Leveldb原理-Log

Posted anda0109

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了04-Leveldb原理-Log相关的知识,希望对你有一定的参考价值。

日志文件包含了一系列的32K大小的块。每个块包含了一系列的记录(record):

block := record* trailer?
record :=
  checksum: uint32     // crc32c of type and data[] ; little-endian
  length: uint16       // little-endian
  type: uint8          // One of FULL, FIRST, MIDDLE, LAST
  data: uint8[length]

如果一个块的最后剩余小于等于6个字节,那么这部分会以0填充,而不会从该部分开始一条新的记录(因为6个字节无法记录一个完整的record)。在读取的时候也会直接跳过这些填充的0字节。
但是,如果一个块的最后刚好剩余7个字节,可以添加一个新的记录进来,并且这个记录被标注为FIRST,其数据长度为0(checksum+length+type,刚好7字节)。用户剩余数据会写入下一个块中。

记录和类型定义如下:

FULL == 1
FIRST == 2
MIDDLE == 3
LAST == 4

FULL标记的record包含了用户记录的全部内容。
FIRST,MIDDLE,LAST用于将用户记录拆分成多条record的情况(由于块大小限制的原因)。FIRST表示用户记录的第一片,LAST是用户记录的最后一片,所有中间片都被标记为MIDDLE。
这样在读取一条完整的用户记录时,要么读取一个FULL record,否则必须读取到FIRST、MIDDLE、LAST record,才能组合出一条完整的用户数据。

Example: 考虑以下长度的用户记录:

A: length 1000
B: length 97270
C: length 8000

A 将会以一条FULL记录存储在第一个块中.

B 将会被分成三片:第一片占用第一个块的剩余部分,第二片占用第二个块,第三片占用第三个块的前面部分,并且第三个块刚好剩余6个字节,这6个字节将不会记录新的数据,而是用0填充。

C 将会以一条FULL记录存储在第四个块中。


Some benefits over the recordio format:

对于大型记录,我们不需要额外的缓冲。

  1. 我们不需要任何启发式重新同步 - 只需转到下一个块边界并扫描。如果有损坏,请跳到下一个块。附带的好处是,当一个日志文件的部分内容作为记录嵌入到另一个日志文件中时,我们不会感到困惑。

  2. 在近似边界处拆分(例如,对于 mapreduce)很简单:找到下一个块边界并跳过记录,直到我们遇到 FULL 或 FIRST 记录。

  3. 对于大记录不需要额外的缓存空间.

Some downsides compared to recordio format:

  1. 没有包装微小的记录。这可以通过添加新的记录类型来解决,因此这是当前实现的缺点,不一定是格式。

  2. 没有压缩。同样,这可以通过添加新的记录类型来解决。

代码实现-LogWriter

了解了日志文件的格式,那么结合代码来看就很容易。下面简要说明代码的实现:
log_writer.h

class Writer 
 public:
  // Create a writer that will append data to "*dest".
  // "*dest" must be initially empty.
  // "*dest" must remain live while this Writer is in use.
  explicit Writer(WritableFile* dest);

  // Create a writer that will append data to "*dest".
  // "*dest" must have initial length "dest_length".
  // "*dest" must remain live while this Writer is in use.
  Writer(WritableFile* dest, uint64_t dest_length);

  Writer(const Writer&) = delete;
  Writer& operator=(const Writer&) = delete;

  ~Writer();
  // 添加一条用户记录,这个函数是关键,在内部实现记录的拆分;拆分后的记录调用EmitPhysicalRecord写入文件
  Status AddRecord(const Slice& slice);

 private:
  // 添加一条物理record,这个是用户记录按规则拆分后的一条record,即一条包含FULL/FIRST/MIDDLE/LAST的记录
  Status EmitPhysicalRecord(RecordType type, const char* ptr, size_t length);

  WritableFile* dest_;
  int block_offset_;  // Current offset in block

  // crc32c values for all supported record types.  These are
  // pre-computed to reduce the overhead of computing the crc of the
  // record type stored in the header.
  uint32_t type_crc_[kMaxRecordType + 1];
;

上述Writer类的主要函数实现就是AddRecord,在这个函数内部,根据我们上述介绍的log文件的格式,将用户的记录按32K大小的边界拆分或不拆分成物理record,然后调用EmitPhysicalRecord将记录写入文件中。
log_reader.h

class Reader 
 public:
  // Interface for reporting errors.
  class Reporter 
   public:
    virtual ~Reporter();

    // Some corruption was detected.  "size" is the approximate number
    // of bytes dropped due to the corruption.
    virtual void Corruption(size_t bytes, const Status& status) = 0;
  ;

  // Create a reader that will return log records from "*file".
  // "*file" must remain live while this Reader is in use.
  //
  // If "reporter" is non-null, it is notified whenever some data is
  // dropped due to a detected corruption.  "*reporter" must remain
  // live while this Reader is in use.
  //
  // If "checksum" is true, verify checksums if available.
  //
  // The Reader will start reading at the first record located at physical
  // position >= initial_offset within the file.
  Reader(SequentialFile* file, Reporter* reporter, bool checksum,
         uint64_t initial_offset);

  Reader(const Reader&) = delete;
  Reader& operator=(const Reader&) = delete;

  ~Reader();

  // Read the next record into *record.  Returns true if read
  // successfully, false if we hit end of the input.  May use
  // "*scratch" as temporary storage.  The contents filled in *record
  // will only be valid until the next mutating operation on this
  // reader or the next mutation to *scratch.
  bool ReadRecord(Slice* record, std::string* scratch);

  // Returns the physical offset of the last record returned by ReadRecord.
  //
  // Undefined before the first call to ReadRecord.
  uint64_t LastRecordOffset();

 private:
  // Extend record types with the following special values
  enum 
    kEof = kMaxRecordType + 1,
    // Returned whenever we find an invalid physical record.
    // Currently there are three situations in which this happens:
    // * The record has an invalid CRC (ReadPhysicalRecord reports a drop)
    // * The record is a 0-length record (No drop is reported)
    // * The record is below constructor's initial_offset (No drop is reported)
    kBadRecord = kMaxRecordType + 2
  ;

  // Skips all blocks that are completely before "initial_offset_".
  //
  // Returns true on success. Handles reporting.
  bool SkipToInitialBlock();

  // Return type, or one of the preceding special values
  unsigned int ReadPhysicalRecord(Slice* result);

  // Reports dropped bytes to the reporter.
  // buffer_ must be updated to remove the dropped bytes prior to invocation.
  void ReportCorruption(uint64_t bytes, const char* reason);
  void ReportDrop(uint64_t bytes, const Status& reason);

  SequentialFile* const file_;
  Reporter* const reporter_;
  bool const checksum_;
  char* const backing_store_;
  Slice buffer_;
  bool eof_;  // Last Read() indicated EOF by returning < kBlockSize

  // Offset of the last record returned by ReadRecord.
  uint64_t last_record_offset_;
  // Offset of the first location past the end of buffer_.
  uint64_t end_of_buffer_offset_;

  // Offset at which to start looking for the first record to return
  uint64_t const initial_offset_;

  // True if we are resynchronizing after a seek (initial_offset_ > 0). In
  // particular, a run of kMiddleType and kLastType records can be silently
  // skipped in this mode
  bool resyncing_;
;

上述代码中ReadRecord即为读取一条完整的用户记录,当然在其内部会将多条物理记录拼接在一起组成一条完整的用户记录,即一条FULL物理记录或者由FIRST\\MIDDLE\\LAST多条物理记录组成的一条记录。ReadPhysicalRecord即为读取一条物理记录。

以上是关于04-Leveldb原理-Log的主要内容,如果未能解决你的问题,请参考以下文章

04-Leveldb实现-Log

JavaScript 预编译(变量提升和函数提升的原理)

永久修复尾部:无法观看“log/development.log”:设备上没有剩余空间

MySQL慢查询记录原理和内容解析

MySQL慢查询记录原理和内容解析

如何使用 mpdf 查找剩余页面空间