MatrixDB是Hive的25.8倍是Impala+Kudu的8.8倍

Posted 2022-05-31 小徐xfg

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了MatrixDB是Hive的25.8倍是Impala+Kudu的8.8倍相关的知识，希望对你有一定的参考价值。

概述

1、大数据的发展历程和面临的痛点

2、MatrixDB 超融合时序数据库介绍

3、MatrixDB TPCH是Hive的25.8倍

4、MatrixDB TPCH是Impala + Kudu的8.8倍
5、MatrixDB TPC-B 达到百万TPS

大数据发展历程

过去大家提到大数据就会联想到Hadoop，而Hadoop是从2003-2004年开始，Google公布了GFS\\ MapReduce\\BigTable 三篇论文后，开始了从Lucene–>Nutch—>Hadoop的演变，到2006年2月成为一套完整独立的软件，便起名为Hadoop。2008年9月Hive成为Hadoop的子项目后，2012年10月，Impala加入Hadoop生态圈，Kudu最早由Cloudera公司开发并在2015年12月3日贡献给Apache基金会。

随着技术的发展围绕Hadoop孪生出来的生态软件越来越多，而这么多的生态软件则需要一个统一的管理还礼软件来维护，在2008年成立的Cloudera是最早将Hadoop商用的公司，旗下的Cloudera Manager产品是集群的软件分发及管理监控平台，可以在几个小时内部署好一个Hadoop集群，并对集群的节点及服务进行实时监控。

然而随着大家对产品的使用所面临的问题也就越来越多，在以Hive做数据仓库时：

（1）Hive 不支持记录级别的增删改操作。

（2）Hive 不支持事物，因为没有增删改，所以主要用来做 OLAPc场景。

（3）Hive 延迟较高，一般是用户T+1，甚至是T+N的场景中，不适用于实时分析。

（4）Hive 运行的是 MapReduce 任务，会受很多限制。

在2017年11月,apache宣布Impala晋升为apache顶级项目后大家逐渐开始使用Impala做HDFS查询。后来Impala+Kudu做数据的储存，并提升了查询的速度。但Kudu则面临：

（1）不支持多行事务。

（2）对SQL的标准支持的比较弱。

（3）不支持数据回滚。

（4）表最多有300列。

（5）表的字段类型支持较少。

在2019年11月开始Cloudera公司旗下的CDH产品开始收费，对所有新版本，包括当前软件的更新和维护版本，都只能通过付费墙访问。对公司的软件升级和软件的维护带来了不小的挑战，遇到了重大的问题不能及时解决，有时严重影响了公司正常提供服务。同时公司的数据积累的越多所面临的问题也越多，对数据的管理比较混乱和数据治理耗时耗力，同时表的查询越来越慢，由之前的T+1能出结果，到目前的T+2才完成70%。急需要一种HTAP数据库来管理数据和提高查询性能。

MatrixDB 超融合时序数据库

关于超融合的发展

过去十几年，分布式技术和数据库技术都有长足发展，很多产品都在自身原有优势基础之上不同程度的探索能力延伸的边界，并取得了良好的进展。在这种大趋势之下，超融合数据库出现。超融合数据库博采OLTP数据库、OLAP数据库和大数据/数据湖众家之长集于一身，形成一种新的技术形态。

交易型数据库（OLTP）：支撑在线交易业务，典型查询涉及数据行比较少，数据频繁增删改查，数据库追求高并发、低延迟。

分析型数据库（OLAP）：支撑在线分析业务，典型查询涉及大量数据行，数据以插入和查询为主，数据清洗后一般不更新或者偶尔更新，数据库追求复杂查询的性能。

专用数据库：支撑某种特定数据处理业务场景，典型产品有时序数据库、图数据库、GIS数据库、文本检索产品等。

大数据/数据湖（Data Lake）：大数据从2005年左右发展起来，起初主要产品是Hadoop，后来孪生出来很多的产品。

而这些问题则会面临数据不同步，效率地下，技术栈复杂，维护成本高等问题。

超融合数据库是技术发展的自然走向， 2011年，451Research提出的NewSQL为OLTP和大数据的融合；2015年，Gartner提出的HTAP为OLTP和OLAP的融合；2020年，Databricks提出的Lakehouse为数据仓库和数据湖融合；数据库发展逐渐从两两融合走向超融合；而预计不远的2022年，超融合数据库技术将实现产品化和商业化。

关于MatrixDB

是服务物联网海量数据存储的超融合时序数据库，具有以下特点：

融合极致时序

融合实时分析

融合数据类型

融合企业级产品特性

MatrixDB 服务的客户和客户评价

MatrixDB与Hive的TPCH测试

测试版本

软件名称	版本
Matrixdb	MatrixDB 4.4.7
Hive	3.1.0
Tez	0.9.1
YARN + MapReduce2	3.1.1
HDFS	3.1.1.3.1

测试表的格式

软件名称	测试格式
Matrixdb	AOCO
Hive	STORED AS TEXTFILE

服务器配置

序号	系统架构	操作系统	主机名	CPU核数	内存	数据盘	是否raid	网卡
1	X86-64	CentOS 7.8	mdw	64c	256GB	SAS, 22T	是	10000Mb/s
2	X86-64	CentOS 7.8	sdw1	64c	256GB	SAS, 22T	是	10000Mb/s
3	X86-64	CentOS 7.8	sdw2	64c	256GB	SAS, 22T	是	10000Mb/s
4	X86-64	CentOS 7.8	sdw3	64c	256GB	SAS, 22T	是	10000Mb/s

测试结果

MatrixDB是Hive的25.8倍

Query	Matrixdb执行时（单位：秒）	Hive TEXTFILE (Tez查询)执行时间（单位：秒）	Hive ORC (Tez查询)执行时间（单位：秒）	Hive PARQUET (Tez查询)执行时间（单位：秒）
SQL1	120	828.75	860.33	527.24
SQL2	56	132.43	142.11	119.07
SQL3	152	1920.80	542.89	539.51
SQL4	62	2544.56	1023.46	1143.06
SQL5	233	1972.58	598.26	574.73
SQL6	4	515.38	225.18	160.37
SQL7	102	5053.18	600.23	568.55
SQL8	59	2016.20	462.6	414.04
SQL9	293	3047.84	2142.53	2323.94
SQL10	133	1679.50	365.12	393.45
SQL11	17	226.12	131.91	213.17
SQL12	47	1749.04	343.39	384.03
SQL13	55	852.39	674.96	529.53
SQL14	6	573.96	331.45	245.94
SQL15	29	1047.79	526.12	599.757
SQL16	23	592.49	704.07	570.75
SQL17	114	6994.56	6022.25	5977.9
SQL18	481	4195.88	1467.66	1641.53
SQL19	26	500.28	304.4	275.34
SQL20	43	2733.27	1596.62	1852.53
SQL21	233	19046.48	13307.95	13354.39
SQL22	18	1375.20	1220.82	1344.34
合计	2306	59598.68	33594.31	33753.167

测试步骤

1、使用TPCH工具生成1024GB大小的数据加载到MatrixDB中并进行22条SQL查询测试

2、使用Hive测试工具TPCH生成1024GB大小的测试数据并进行22条SQL查询测试

3、测试22个query的内容中包含若干表Join、子查询和Group-by聚合等

MatrixDB测试结果

mxadmin=# select * from tpch_reports.sql order by id;       id  | description | tuples |    duration      -----+-------------+--------+-----------------       101 | tpch.01     |      4 | 00:02:00.120885       102 | tpch.02     |    100 | 00:00:56.5631       103 | tpch.03     |     10 | 00:02:32.152663       104 | tpch.04     |      5 | 00:01:02.62619       105 | tpch.05     |      5 | 00:03:53.233652       106 | tpch.06     |      1 | 00:00:04.4446       107 | tpch.07     |      4 | 00:01:42.102137       108 | tpch.08     |      2 | 00:00:59.59693       109 | tpch.09     |    175 | 00:04:53.293484       110 | tpch.10     |     20 | 00:02:13.133002       111 | tpch.11     |      0 | 00:00:17.17094       112 | tpch.12     |      2 | 00:00:47.47593       113 | tpch.13     |     29 | 00:00:55.55813       114 | tpch.14     |      1 | 00:00:06.6373       115 | tpch.15     |      1 | 00:00:29.29671       116 | tpch.16     |  27840 | 00:00:23.23941       117 | tpch.17     |      1 | 00:01:54.114318       118 | tpch.18     |    100 | 00:08:01.481215       119 | tpch.19     |      1 | 00:00:26.26927       120 | tpch.20     | 113661 | 00:00:43.43858       121 | tpch.21     |    100 | 00:03:53.233897       122 | tpch.22     |      7 | 00:00:18.18489      (22 rows)

Hive TEXTFILE测试结果

 ***********************************************
      *           PC-H benchmark on Hive            *
      ***********************************************      Running Hive from      Running Hadoop from      See benchmark.log for more details of query errors.      Executing Trial #1 of 1 trial(s)...      Running Hive query: tpch/q1_pricing_summary_report.hive      Time:828.75      Running Hive query: tpch/q2_minimum_cost_supplier.hive      Time:132.43      Running Hive query: tpch/q3_shipping_priority.hive      Time:1920.80      Running Hive query: tpch/q4_order_priority.hive      Time:2544.56      Running Hive query: tpch/q5_local_supplier_volume.hive      Time:1972.58      Running Hive query: tpch/q6_forecast_revenue_change.hive      Time:515.38      Running Hive query: tpch/q7_volume_shipping.hive      Time:5053.18      Running Hive query: tpch/q8_national_market_share.hive      Time:2016.20      Running Hive query: tpch/q9_product_type_profit.hive      Time:3047.84      Running Hive query: tpch/q10_returned_item.hive      Time:1679.50      Running Hive query: tpch/q11_important_stock.hive      Time:226.12      Running Hive query: tpch/q12_shipping.hive      Time:1749.04      Running Hive query: tpch/q13_customer_distribution.hive      Time:852.39      Running Hive query: tpch/q14_promotion_effect.hive      Time:573.96      Running Hive query: tpch/q15_top_supplier.hive      Time:1047.79      Running Hive query: tpch/q16_parts_supplier_relationship.hive      Time:592.49      Running Hive query: tpch/q17_small_quantity_order_revenue.hive      Time:6994.56      Running Hive query: tpch/q18_large_volume_customer.hive      Time:4195.88      Running Hive query: tpch/q19_discounted_revenue.hive      Time:500.28      Running Hive query: tpch/q20_potential_part_promotion.hive      Time:2733.27      Running Hive query: tpch/q21_suppliers_who_kept_orders_waiting.hive      Time:19046.48      Running Hive query: tpch/q22_global_sales_opportunity.hive      Time:1375.20

MatrixDB与Impal+Kudu的TPCH测试

测试版本

软件名字	版本
Matrixdb	MatrixDB 4.4.7
Impala	3.2.0
Kudu	1.10.0
HDFS	3.1.1.3.1
Hive	3.1.0

测试表的格式

软件	表的类型
Matrixdb	AOCO
Impala	STORED AS KUDU

服务器的配置

序号	系统架构	操作系统	主机名	CPU核数	内存	数据盘	是否raid	网卡
1	X86-64	CentOS 7.8	mdw	64c	256GB	SAS, 22T	是	10000Mb/s
2	X86-64	CentOS 7.8	sdw1	64c	256GB	SAS, 22T	是	10000Mb/s
3	X86-64	CentOS 7.8	sdw2	64c	256GB	SAS, 22T	是	10000Mb/s
4	X86-64	CentOS 7.8	sdw3	64c	256GB	SAS, 22T	是	10000Mb/s

测试结果

MatrixDB是Impala+kudu的8.8倍

Query	Matrixdb 执行时间（单位：秒）	Impala + Kudu （单位：秒）
SQL1	76.76	974.98
SQL2	31.31	159.64
SQL3	54.54	282.6
SQL4	17.17	1210.03
SQL5	39.39	480.66
SQL6	2.22	75.46
SQL7	26.26	412.55
SQL8	29.29	567.42
SQL9	596.59	2557.13
SQL10	236.23	408.3
SQL11	32.32	189.21
SQL12	191.19	249.67
SQL13	27.27	171.95
SQL14	6.6	120.94
SQL15	14.14	109.37
SQL16	25.25	858.21
SQL17	110.11	949.23
SQL18	187.18	962.18
SQL19	47.47	651.57
SQL20	28.28	504.72
SQL21	120.12	4932.08
SQL22	13.13	143.36
合计	1912.82	16971.26

测试步骤

1、使用TPCH测试Impala+Kudu的查询性能

2、把Impala + Kudu 测试的数据同步到MatrixDB中进行TPCH测试

MatrixDB测试结果

 mxadmin=# select * from tpch_reports.sql order by id;       id  | description | tuples |    duration      -----+-------------+--------+-----------------       101 | tpch.01     |      4 | 00:01:16.7637       102 | tpch.02     |    100 | 00:00:31.31441       103 | tpch.03     |     10 | 00:00:54.54473       104 | tpch.04     |      5 | 00:00:17.17539       105 | tpch.05     |      5 | 00:00:39.39999       106 | tpch.06     |      1 | 00:00:02.228       107 | tpch.07     |      4 | 00:00:26.26793       108 | tpch.08     |      2 | 00:00:29.29319       109 | tpch.09     |    175 | 00:09:56.596787       110 | tpch.10     |     20 | 00:03:56.236166       111 | tpch.11     |      0 | 00:00:32.32528       112 | tpch.12     |      2 | 00:03:11.191703       113 | tpch.13     |     12 | 00:00:27.27915       114 | tpch.14     |      1 | 00:00:06.6014       115 | tpch.15     |      1 | 00:00:14.14125       116 | tpch.16     |  27840 | 00:00:25.25445       117 | tpch.17     |      1 | 00:01:50.110183       118 | tpch.18     |    100 | 00:03:07.187142       119 | tpch.19     |      1 | 00:00:47.47913       120 | tpch.20     |  79490 | 00:00:28.28746       121 | tpch.21     |    100 | 00:02:00.120883       122 | tpch.22     |      7 | 00:00:13.1314      (22 rows)

Impala+Kudu测试结果

 # cat tpch_benchmark_impala.log      ***********************************************
      *          PC-H benchmark on Impala-shell     *
      ***********************************************      Running Impala-shell ....      See benchmark_impala.log for more details of query errors.      Executing Trial #1 of 1 trial(s)...      Running Impala-shell query: tpch_impala/textfile_to_impala.impala      Time:5719.93      Running Impala-shell query: tpch_impala/q1_pricing_summary_report.impala      Time:974.98      Running Impala-shell query: tpch_impala/q2_minimum_cost_supplier.impala      Time:159.64      Running Impala-shell query: tpch_impala/q3_shipping_priority.impala      Time:282.60      Running Impala-shell query: tpch_impala/q4_order_priority.impala      Time:1210.03      Running Impala-shell query: tpch_impala/q5_local_supplier_volume.impala      Time:480.66      Running Impala-shell query: tpch_impala/q6_forecast_revenue_change.impala      Time:75.46      Running Impala-shell query: tpch_impala/q7_volume_shipping.impala      Time:412.55      Running Impala-shell query: tpch_impala/q8_national_market_share.impala      Time:567.42      Running Impala-shell query: tpch_impala/q9_product_type_profit.impala      Time:2557.13      Running Impala-shell query: tpch_impala/q10_returned_item.impala      Time:408.30      Running Impala-shell query: tpch_impala/q11_important_stock.impala      Time:189.21      Running Impala-shell query: tpch_impala/q12_shipping.impala      Time:249.67      Running Impala-shell query: tpch_impala/q13_customer_distribution.impala      Time:171.95      Running Impala-shell query: tpch_impala/q14_promotion_effect.impala      Time:120.94      Running Impala-shell query: tpch_impala/q15_top_supplier.impala      Time:109.37      Running Impala-shell query: tpch_impala/q16_parts_supplier_relationship.impala      Time:858.21      Running Impala-shell query: tpch_impala/q17_small_quantity_order_revenue.impala      Time:949.23      Running Impala-shell query: tpch_impala/q18_large_volume_customer.impala      Time:962.18      Running Impala-shell query: tpch_impala/q19_discounted_revenue.impala      Time:651.5      Running Impala-shell query: tpch_impala/q20_potential_part_promotion.impala      Time:504.72      Running Impala-shell query: tpch_impala/q21_suppliers_who_kept_orders_waiting.impala      Time:4932.08      Running Impala-shell query: tpch_impala/q22_global_sales_opportunity.impala      Time:143.36      ***********************************************

MatrixDB TPC-B 达到百万TPS

TPC-B 是什么？

TPC-B 是数据库行业中使用最广泛，也是最重要的基准测试之一。由 TPC (Transaction Processing Performance Council，事务处理性能委员会) 提供的 Benchmark，主要用于衡量一个系统每秒能够处理的并发事务数。TPC-B 经常用于对数据库系统的事务性能压测，其衡量指标是每秒处理的事务数量，即 TPS（Transactions per Second）。

测试结果