lightdb-分布式物理备份脚本

Posted 紫无之紫

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了lightdb-分布式物理备份脚本相关的知识,希望对你有一定的参考价值。

lightdb 分布式物理备份脚本v1.0

lightdb 从lightdb22.4 开始支持分布式物理备份, 提供lt_distributed_probackup.py脚本对分布式集群进行一键备份,不再需要对每个节点执行lt_probackup命令,但不保证某一次备份各节点间数据的一致性(不恢复到最新,只恢复到某一次备份)。

简介

lt_distributed_probackup.py 通过指定cn节点信息,可以实现对整个分布式集群的备份,在原集群异常时对原集群恢复直接恢复即可,恢复到新集群需要指定新的cn和dn信息。

备份恢复流程

此流程基于持续归档,归档模式备份

环境信息

backup server: 192.168.247.126;
backup dir: /home/lightdb/backup

coordinator info: 192.168.247.127:54332
coordinator data dir: /home/lightdb/data
datanode1 info: 192.168.247.128:54332
datanode2 info: 192.168.247.129:54332

new cluster(cn;dn1;dn2): 192.168.247.130;192.168.247.131;192.168.247.132

流程

  1. 需要配置免密,根据使用方式不同可能需要配置对本机的免密,本机ip 和127.0.0.1

  2. 初始化备份目录

    lt_distributed_probackup.py init -B /home/lightdb/backup
    
  3. 添加实例

    lt_distributed_probackup.py add-instance -B /home/lightdb/backup -D /home/lightdb/data --instance cn -h192.168.247.128 -p5432
    
  4. 配置持续归档

    # cn
    archive_mode=on
    archive_command="/home/lightdb/lightdb-x/13.8-22.4/bin/lt_probackup archive-push -B /home/lightdb/backup --instance=cn --wal-file-name=%f --remote-host=192.168.247.126"
    # dn1 --instance 的值在上述添加实例(add-instance)时获取, 或者自己根据规则计算,具体看下面的note
    archive_mode=on
    archive_command="/home/lightdb/lightdb-x/13.8-22.4/bin/lt_probackup archive-push -B /home/lightdb/backup --instance=cn_dn_1 --wal-file-name=%f --remote-host=192.168.247.126"
    
    
  5. 备份分布式集群

    lt_distributed_probackup.py backup -B /home/lightdb/backup -D /home/lightdb/data --instance cn -h192.168.247.128 -p5432 -b full --parallel-num=1
    
  6. 原集群上恢复

    # 原集群
    lt_distributed_probackup.py restore  -B /home/lightdb/backup --instance cn --parallel-num=1
    
    # 新集群 list 顺序为cn;dn1;dn2 , dn1 为node_id 小的dn节点(pg_dist_node)
    lt_distributed_probackup.py restore  -B /home/lightdb/backup --instance cn --parallel-num=1 --remote-host='192.168.247.130;192.168.247.131;192.168.247.132'
    
  7. checkdb 检测数据库物理文件准确性

    lt_distributed_probackup.py checkdb -B /home/lightdb/backup  --instance cn
    lt_distributed_probackup.py checkdb -h10.19.70.50 -p54332
    
  8. 设置config

    lt_distributed_probackup.py set-config -B /home/lightdb/backup  --instance cn  --archive-timeout=6min
    
  9. 查看配置

    lt_distributed_probackup.py show-config -B /home/lightdb/backup  --instance cn
    
  10. 检测备份文件

    lt_distributed_probackup.py validate -B /home/lightdb/backup  --instance cn
    
  11. 查看备份

    # 展示每次分布式备份总体情况, 其中status 为总体情况, id 为分布式backupid, 其他字段为cn的backup的属性
    lt_distributed_probackup.py show -B /home/lightdb/backup  --instance cn
    [lightdb@lightdb bin]$ lt_distributed_probackup.py show -B /data/lightdb/chuhx/backup  --instance cn
    =================================================================================================================================================
    Instance  Version  ID                  Recovery Time           Mode  WAL Mode  TLI  Time    Data    WAL  Zratio  Start LSN   Stop LSN    Status  
    =================================================================================================================================================
    cn        13       DISTRIBUTED_RO1W1T  2023-01-06 14:15:40+08  PAGE  ARCHIVE   1/1   14s    60MB  512MB    1.00  A/20000028  A/4001CD38  ERROR   
    cn        13       DISTRIBUTED_RO03Y6  2023-01-05 15:10:59+08  PAGE  ARCHIVE   1/1    7s  3698kB  512MB    1.00  9/60000120  9/8000B3D0  OK      
    cn        13       DISTRIBUTED_RO03UR  2023-01-05 15:09:02+08  FULL  ARCHIVE   1/0   13s   938MB  512MB    1.00  8/A00000B8  8/C0064920  OK   
    
    # 展示dn 的backup 属性 使用--detail 选项
    [lightdb@lightdb bin]$ lt_distributed_probackup.py show -B /data/lightdb/chuhx/backup  --instance cn --detail
    
    ********Summary*********
    BACKUP INSTANCE 'cn'
    ----------------
    =====================================================================================================================================
     Instance  Version  ID      Recovery Time           Mode  WAL Mode  TLI  Time    Data    WAL  Zratio  Start LSN   Stop LSN    Status 
    =====================================================================================================================================
     cn        13       RO1W1T  2023-01-06 14:15:40+08  PAGE  ARCHIVE   1/1   14s    60MB  512MB    1.00  A/20000028  A/4001CD38  OK     
     cn        13       RO03Y6  2023-01-05 15:10:59+08  PAGE  ARCHIVE   1/1    7s  3698kB  512MB    1.00  9/60000120  9/8000B3D0  OK     
     cn        13       RO03UR  2023-01-05 15:09:02+08  FULL  ARCHIVE   1/0   13s   937MB  512MB    1.00  8/A00000B8  8/C0064920  OK     
    
    BACKUP INSTANCE 'cn_dn_2'
    ----------------
    =====================================================================================================================================
     Instance  Version  ID      Recovery Time           Mode  WAL Mode  TLI  Time    Data    WAL  Zratio  Start LSN   Stop LSN    Status 
    =====================================================================================================================================
     cn_dn_2   13       RO1W27  ----                    PAGE  ARCHIVE   1/1    3s       0      0    1.00  8/E0000028  0/0         ERROR  
     cn_dn_2   13       RO03YD  2023-01-05 15:11:09+08  PAGE  ARCHIVE   1/1   10s  3706kB  512MB    1.00  8/20000028  8/4000C468  OK     
     cn_dn_2   13       RO03V5  2023-01-05 15:09:16+08  FULL  ARCHIVE   1/0   13s   937MB  512MB    1.00  7/600000F8  7/8001C8D0  OK     
    
    BACKUP INSTANCE 'cn_dn_3'
    ----------------
    =====================================================================================================================================
     Instance  Version  ID      Recovery Time           Mode  WAL Mode  TLI  Time    Data    WAL  Zratio  Start LSN   Stop LSN    Status 
    =====================================================================================================================================
     cn_dn_3   13       RO1W2A  2023-01-06 14:15:56+08  PAGE  ARCHIVE   1/1   13s    61MB  512MB    1.00  8/A0000028  8/C00158E8  OK     
     cn_dn_3   13       RO03YN  2023-01-05 15:11:17+08  PAGE  ARCHIVE   1/1    9s  3930kB  512MB    1.00  7/E0000028  8/124C0     OK     
     cn_dn_3   13       RO03VJ  2023-01-05 15:09:28+08  FULL  ARCHIVE   1/0   13s   937MB  512MB    1.00  7/200000F8  7/4001B920  OK 
    
  12. 删除备份文件

    lt_distributed_probackup.py delete -B /home/lightdb/backup  --instance cn  --delete-expired --retention-redundancy=1
    
  13. 删除实例

    lt_distributed_probackup.py del-instance -B /home/lightdb/backup  --instance cn
    
  14. merge, set-backup 只对具体备份文件起效,与lt_probackup一致

    lt_distributed_probackup.py merge -B /home/lightdb/backup  --instance cn -i RMICH6 -j 2 --progress --no-validate --no-sync
    lt_distributed_probackup.py set-backup -B /home/lightdb/backup  --instance cn_dn_2  -i RMI6YR --note='cn_dn_2'
    

note

  1. dn节点的instance name 根据cn 指定的instance name 生成,格式如下:cn_name_db_id, id 为pg_dist_node中的node_id.

  2. 在添加新节点时,通过调用add-instance --no_distribution来添加实例。

  3. --backup-id 可以指定为分布式backupid, 通过show命令可以看到分布式backupid

  4. 如果报如下错误,可以通过修改sshd 的配置来解决, 调大/etc/ssh/sshd_config中的MaxStartups配置即可,MaxStartups是用来限制并行认证ssh客户端数量的. 注意是认证的数量,不是登录的数量. 也就是说,已经登录成功的不算在里面. 格式10:30:100 表示超过10个有30%概率失败,到达100一定失败。

    ERROR: Agent error: kex_exchange_identification: read: Connection reset by peer
    

    此报错是由于-j--parallel-num 选项设置的过大,导致同时发起ssh 连接,导致并行认证数过多。

  5. 使用init命令时, 如果没有权限创建目录, 命令会执行成功, 实际目录没有创建。

以上是关于lightdb-分布式物理备份脚本的主要内容,如果未能解决你的问题,请参考以下文章

lightdb22.4-分布式逻辑备份与恢复

lightdb22.4-分布式逻辑备份与恢复

lightdb_service.py - lightdb一键启停脚本

lightdb_service.py - lightdb一键启停脚本

lightdb22.2-新增集群启停脚本

lightdb22.2-新增集群启停脚本