clickhouse性能测试
Posted 走在大数据架构路上的笔
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了clickhouse性能测试相关的知识,希望对你有一定的参考价值。
“ 试着去了解一些超出你认知范围的事情,可能会带给你一些意想不到的成长。 ”
准备
集群环境建表
CREATE TABLE test.ontime ON CLUSTER yc_clickhouse \
( Year UInt16, \
Quarter UInt8, \
Month UInt8, \
DayofMonth UInt8, \
DayOfWeek UInt8, \
FlightDate Date, \
UniqueCarrier FixedString(7), \
AirlineID Int32, \
Carrier FixedString(2), \
TailNum String, \
FlightNum String, \
OriginAirportID Int32, \
OriginAirportSeqID Int32, \
OriginCityMarketID Int32, \
Origin FixedString(5), \
OriginCityName String, \
OriginState FixedString(2), \
OriginStateFips String, \
OriginStateName String, \
OriginWac Int32, \
DestAirportID Int32, \
DestAirportSeqID Int32, \
DestCityMarketID Int32, \
Dest FixedString(5), \
DestCityName String, \
DestState FixedString(2), \
DestStateFips String, \
DestStateName String, \
DestWac Int32, \
CRSDepTime Int32, \
DepTime Int32, \
DepDelay Int32, \
DepDelayMinutes Int32, \
DepDel15 Int32, \
DepartureDelayGroups String, \
DepTimeBlk String, \
TaxiOut Int32, \
WheelsOff Int32, \
WheelsOn Int32, \
TaxiIn Int32, \
CRSArrTime Int32, \
ArrTime Int32, \
ArrDelay Int32, \
ArrDelayMinutes Int32, \
ArrDel15 Int32, \
ArrivalDelayGroups Int32, \
ArrTimeBlk String, \
Cancelled UInt8, \
CancellationCode FixedString(1), \
Diverted UInt8, \
CRSElapsedTime Int32, \
ActualElapsedTime Int32, \
AirTime Int32, \
Flights Int32, \
Distance Int32, \
DistanceGroup UInt8, \
CarrierDelay Int32, \
WeatherDelay Int32, \
NASDelay Int32, \
SecurityDelay Int32, \
LateAircraftDelay Int32, \
FirstDepTime String, \
TotalAddGTime String, \
LongestAddGTime String, \
DivAirportLandings String, \
DivReachedDest String, \
DivActualElapsedTime String, \
DivArrDelay String, \
DivDistance String, \
Div1Airport String, \
Div1AirportID Int32, \
Div1AirportSeqID Int32, \
Div1WheelsOn String, \
Div1TotalGTime String, \
Div1LongestGTime String, \
Div1WheelsOff String, \
Div1TailNum String, \
Div2Airport String, \
Div2AirportID Int32, \
Div2AirportSeqID Int32, \
Div2WheelsOn String, \
Div2TotalGTime String, \
Div2LongestGTime String, \
Div2WheelsOff String, \
Div2TailNum String, \
Div3Airport String, \
Div3AirportID Int32, \
Div3AirportSeqID Int32, \
Div3WheelsOn String, \
Div3TotalGTime String, \
Div3LongestGTime String, \
Div3WheelsOff String, \
Div3TailNum String, \
Div4Airport String, \
Div4AirportID Int32, \
Div4AirportSeqID Int32, \
Div4WheelsOn String, \
Div4TotalGTime String, \
Div4LongestGTime String, \
Div4WheelsOff String, \
Div4TailNum String, \
Div5Airport String, \
Div5AirportID Int32, \
Div5AirportSeqID Int32, \
Div5WheelsOn String, \
Div5TotalGTime String, \
Div5LongestGTime String, \
Div5WheelsOff String, \
Div5TailNum String \
) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192);
分布式表
CREATE TABLE test.ontimetest ON CLUSTER yc_clickhouse AS test.ontime ENGINE = Distributed(yc_clickhouse, test, ontime, rand());
下载数据
wget http://transtats.bts.gov/PREZIP/On_Time_Reporting_Carrier_On_Time_Performance_1987_present_1987_10.zip
for s in `seq 1987 2017`
do
for m in `seq 1 12`
do
cp On_Time_Reporting_Carrier_On_Time_Performance_\(1987_present\)_1987_10.csv On_Time_Reporting_Carrier_On_Time_Performance_\(${s}_present\)_${s}_${m}.csv
sed -i 's/${s}/'${s}'/g' On_Time_Reporting_Carrier_On_Time_Performance_\(${s}_present\)_${s}_${m}.csv
sed -i 's/${m}/'${m}'/g' On_Time_Reporting_Carrier_On_Time_Performance_\(${s}_present\)_${s}_${m}.csv
sed -i 's/1987/'${s}'/g' On_Time_Reporting_Carrier_On_Time_Performance_\(${s}_present\)_${s}_${m}.csv
sed -i 's/'${m}'/'${m}'/g' On_Time_Reporting_Carrier_On_Time_Performance_\(${s}_present\)_${s}_${m}.csv
done
done
将数据导入clickhouse脚本
for i in *.csv
do
echo $i
cat $i | sed 's/\.00//g' | clickhouse-client --input_format_with_names_use_header=0 --host=10.218.1.22 --query="INSERT INTO test.ontime FORMAT CSVWithNames";
done
配置说明
配置:3台机器,每台机器64内存,1.8T磁盘
数据量:67G, 1.7亿条 ,每条109个字段
count 性能测试
select count(1) from (select ArrTime,CRSArrTime,FlightDate from test.ontimetest limit 100000000) t1 ;
1000 万:0.071 s 21.47 MB
1亿 :0.335 s 201.67 MB
1亿7000万:14.826 s 347.23MB
group by 性能测试
SELECT OriginCityName,DestCityName,count(*) AS flights FROM (select OriginCityName,DestCityName from ontimetest limit 1000000) t1 GROUP BY OriginCityName, DestCityName ORDER BY flights DESC LIMIT 20 ;
100万 0.256 s 53.30 MB
1000万 1.419 s 455.16MB
1亿 31.068 s 4.45GB
一亿七千万:36.267 s 7.71GMB
join 性能测试
select * from (select ArrTime,CRSArrTime,FlightDate from ontimetest) t1 ALL INNER JOIN (select ArrTime,CRSArrTime,FlightDate from ontimetest limit 10000000) t2 on t1.ArrTime=t2.CRSArrTime limit 100 ;
1千万 join 10万 0.412s 22.94 MB
1亿 join 10万 0.433 s 23.59 MB
1亿7000万 join 10万 0.413 s 24.25 MB
1千万 join 100万 4.414 s 34.16 MB
1亿 join 100万 4.080 s 33.69 MB
1亿7000万 join 100万 4.349 s 34.16 MB
1千万 join 1000万 61.153 s 114.60 MB Memory limit (for query) exceeded: would use 10.40 GiB (attempt to allocate chunk of 4295356872 bytes), maximum: 9.31 GiB
1亿 join 1000万 58.289 s 114.60 MB Memory limit (for query) exceeded: would use 10.40 GiB (attempt to allocate chunk of 4295356872 bytes), maximum: 9.31 GiB.
1亿7000万 join 1000万 61.385 s 115.12 MB. Memory limit (for query) exceeded: would use 10.40 GiB (attempt to allocate chunk of 4295356872 bytes), maximum: 9.31 GiB.
性能测试总结:
在count 方面,速度很快,消耗内存较大
在group by 方面,速度很快,消耗内存很大
在 join 方面,性能较弱(相对于spark而言,速度很快,作为实时查询系统,还是较慢),消耗内存、CPU较大
其他方面:
并发较小,官网查询建议100 Queries / second,所以不适合做业务型高并发查询
列式存储:数据格式很类似,所以易于压缩,减小IO的消耗 (hadoop 是行存储,加载数据很耗费时间);在列式的查询中,速度很快,但是在查询字段较多的情况下,速度较慢
以上是关于clickhouse性能测试的主要内容,如果未能解决你的问题,请参考以下文章