出于灾难恢复的目的,如何将 Greenplum DB 复制到另一个数据中心?

Posted

技术标签:

【中文标题】出于灾难恢复的目的,如何将 Greenplum DB 复制到另一个数据中心?【英文标题】:How to replicate, for DR purposes, a Greenplum DB to another data centre? 【发布时间】:2014-09-22 07:52:22 【问题描述】:

我们正计划建立一个大型 Greenplum 数据库(在前 18 个月内从 10 TB 增长到 100 TB)。传统的备份和恢复工具无济于事,因为我们要处理 24 小时 RPO/RTO。 有没有一种方法可以将数据库复制到我们的 DR 站点而不使用块复制(即在 SAN 上放置一个段并进行镜像)?

【问题讨论】:

【参考方案1】:

您有多种选择:

    双 ETL。复制输入数据并在两个站点上运行相同的 ETL。每周左右将它们与备份恢复同步 备份-恢复。简单的备份-恢复可能没有那么高效。但是,如果您使用 DataDomain,它可以在块级别执行重复数据删除并仅存储更改的块。它可以卸载重复数据删除任务以在 Greenplum 集群 (DDBoost) 上运行。此外,在复制到远程站点的情况下,它只会复制更改的块,这将大大减少复制时间。根据我的经验,如果 DD 上的干净备份需要 12 小时,那么后续 DDBoost 备份将需要 4 小时 + 4 小时来复制数据 自定义解决方案。我知道将数据复制到远程站点是作为 ETL 过程的一部分的情况。对于 ETL 作业,您知道哪些表已更改,它们被添加到复制队列并使用外部表移动到远程站点。分析师在一个特殊的沙箱中工作,他们的沙箱每天通过备份-恢复进行复制

目前 Greenplum 没有内置的 WAN 复制解决方案,所以这几乎是所有可供选择的选项。

【讨论】:

【参考方案2】:

我对此进行了一些调查。这是我的结果

I.  Using EMC Symmetrix VMAX SAN(Storage Area Network)  Mirror  and SRDF (Symmetrix Remote Data Facility) remote replication software
Please refer to h12079-vnx-replication-technologies-overview-wp.pdf for details
  Preconditions
             1. Having EMC Symmetrix VMAX SAN installed
             2. Having SRDF softeware

  Advantages of 3 different modes
 1. Symmetrix Remote Data Facility / Synchronous (SRDF/S)
            Provides a no data loss solution (Zero RPO). 
            No server resource contention for remote mirroring operation. 
            Can perform restoration of primary site with minimal impact to      application.  Performance on remote site.  Enterprise
    disaster recovery solution.  Supports replicating over IP and Fibre
    Channel protocols. 

    2.  Symmetrix Remote Data Facility / Asynchronous (SRDF/A) Extended-distance data replication that supports longer distances
    than SRDF/S.  SRDF/A does not affect host performance, because host
    activity is decoupled from the remote copy process.  Efficient link
    utilization that results in lower link-bandwidth requirements.
    Facilities to invoke failover and restore operations.  Supports
    replicating over IP and Fibre Channel protocols. 

    3.  Symmetrix Remote Data Facility / Data Mobility (SRDF/DM)

II. Using Backup Tools

Please refer to http://gpdb.docs.pivotal.io/4350/admin_guide/managing/backup.html for details
Parallel Backup 
Parallel backup utility gpcrondump 

Non-parallel backup
It is not recommended. It is used for migrate PostgreSQL databases to GreenPlum databases

Parallel Restore 
Support system with the same configuration and different configuration with the source GreenPlum database configuration

Non-Parallel Restore
pg_restore requires to modified the create statement to add distributed by clause

 
Disadvantages
1.  The backup process locks table, it put an EXCLUSIVE lock on table pg_class. It means that read permission is only allowed in this period. 
2.  After releasing the EXCLUSIVE lock on table pg_clas, it will put an ACCESS SHARE lock on all the tables, it only allows read access during the lock period.


III.    Replay DDL statements 
In PostgreSQL, there is a parameters to log all the sql statements to a file. 
In the data/postgresql.conf, modify log_statement to ‘all’
Write an application to get the DML and DDL statement, and run them in the DR servers.
Advantage
1.  Easy to configure and maintain
2.  No decrease in the performance
Disadvantage
1.  Need additional storage for the statement logging

IV. Parse the wal log of PostgreSQL 
Parse the wal log to extract the DDL statement from the log and then run all the generated DDL statements in the DR GreenPlum
Advantage
1.  Doesn’t impact the source GreenPlum Database
Disadvantage
1.  Write code to parse the wal log
2.  Not easy to parse the log, there are not enough documents about the wal log. 
3.  Don’t know if it is feasible for GreenPlum, as it is one solution for PostgreSQL.

【讨论】:

以上是关于出于灾难恢复的目的,如何将 Greenplum DB 复制到另一个数据中心?的主要内容,如果未能解决你的问题,请参考以下文章

如何执行 GreenPlum 6.x 备份和恢复

GreenPlum 大数据平台--segment 失效问题恢复

如何使用 vm、磁盘和本地文件副本正确处理 Azure 云的灾难恢复

WSFC多站点与灾难恢复

灾难恢复和云计算的3个误区!

mysql中怎么单独备份一个表