Flume与Kafka-connect性能比较
Posted 一木呈
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Flume与Kafka-connect性能比较相关的知识,希望对你有一定的参考价值。
1 系统环境
主机名称 |
CPU |
网卡 |
内存 |
Hadoop001 |
Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz*2核 |
Speed: 10000Mb/s |
Size: 16384 MB * 8 条 Speed: 2133 MHz /sys/kernel/mm/transparent_hugepage/enabled is 'never' /sys/kernel/mm/transparent_hugepage/defrag is 'never' |
2 测试环境
2.1 Flume:
Flume 1.8.0
Kafka 0.11.0.1
$ more /opt/beh/core/kafka/bin/kafka-run-class.sh|grep KAFKA_HEAP_OPTS=
KAFKA_HEAP_OPTS="-Xmx15000M"
2.2 Kafka-connect-fs:
Kafka-connect-fs 0.3
Kafka 0.11.0.1
$ more/opt/beh/core/flume/conf/flume-env.sh|grep JAVA_OPTS
export JAVA_OPTS="-Xms10240m-Xmx15000m -Dcom.sun.management.jmxremote"
3 测试数据
测试数据总计680MB,每个文件总计3000000条数据:
[hadoop@hadoop001 testfile]$ cat test0.txt|head-3
my test message 0
my test message 1
my test message 3
[hadoop@hadoop001 testfile]$ ll -h
total 677M
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test0.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test1.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test2.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test3.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test4.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test5.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test6.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test7.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:50 test8.txt
-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:50 test9.txt
4 测试过程
4.1 Flume消费
创建Flume生产指定topic
Kafka-topics.sh --create --topic flume_log --zookeeper 172.16.40.116:2181,172.16.40.114:2181,172.16.40.115:2181/kafka011 --replication-factor1 --partitions 3
Flume-ng 配置
[hadoop@hadoop001 workplace]$ more flume_log.conf
#Name the components on this agent
agent.sources= r1
agent.sinks= k1
agent.channels= c1
#Describe/configure the source
agent.sources.r1.type= spooldir
agent.sources.r1.spoolDir=/home/hadoop/cdy/testfile
agent.sources.r1.channels= c1
#Use a channel which buffers events in memory
agent.channels.c1.type= memory
agent.channels.c1.capacity= 10000
agent.channels.c1.transactionCapacity= 2000
#Describe/configure the source
agent.sinks.k1.type=org.apache.flume.sink.kafka.KafkaSink
agent.sinks.k1.kafka.topic = flume_log
agent.sinks.k1.channel= c1
agent.sinks.k1.kafka.bootstrap.servers =172.16.40.116:9092,172.16.40.114:9092,172.16.40.115:9092
agent.sinks.k1.kafka.flumeBatchSize = 20
agent.sinks.k1.kafka.producer.acks = 1
Flume测试脚本
[hadoop@hadoop001 workplace]$ more flume_test.sh
#!/bin/bash
dir=/home/hadoop/cdy/testfile
for file in `ls $dir`
donew=`echo $file|sed 's/.COMPLETED//g'`
mv$dir/$file $dir/$new
done
START=`date +%s`
$FLUME_HOME/bin/flume-ng agent --conf conf--conf-file /home/hadoop/cdy/workplace/flume_log.conf --name agent-Dflume.root.logger=DEBUG,console 2>&1 >/dev/nu
ll &
while true
do
file=`ls/home/hadoop/cdy/testfile|grep -i COMPLETED|wc -l`
if [ $file-eq 10 ];then
END=`date +%s`
break
fi
sleep 0.1
done
echo $END $START|awk '{print $1-$2}'
4.2 Kafka-connect-fs消费
创建kafka-connect-fs消费指定的topic
kafka-topics.sh --create –-topic connect-test--zookeeper 172.16.40.116:2181,172.16.40.114:2181,172.16.40.115:2181/kafka011 --replication-factor1 --partitions 3
Kafka-connect-fs并行消费配置
[hadoop@hadoop001 workplace]$ more kafka-connect-fs.properties
name=FsSourceConnector
connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector
tasks.max=10
fs.uris=file:///home/hadoop/cdy/testfile/test0.txt,file:///home/hadoop/cdy/testfile/test1.txt,file:///home/hadoop/cdy/testfile/test2.txt,file:///home/hadoop/
cdy/testfile/test3.txt,file:///home/hadoop/cdy/testfile/test4.txt,file:///home/hadoop/cdy/testfile/test5.txt,file:///home/hadoop/cdy/testfile/test6.txt,file:
///home/hadoop/cdy/testfile/test7.txt,file:///home/hadoop/cdy/testfile/test8.txt,file:///home/hadoop/cdy/testfile/test9.txt
topic=connect-test
policy.class=com.github.mmolimar.kafka.connect.fs.policy.SimplePolicy
policy.recursive=true
policy.regexp=.*
file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.TextFileReader
Kafka-connect-fs串行消费配置
[hadoop@hadoop001 workplace]$ morekafka-connect1-fs.properties
name=FsSourceConnector
connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector
tasks.max=1
fs.uris=file:///home/hadoop/cdy/testfile/test0.txt,file:///home/hadoop/cdy/testfile/test1.txt,file:///home/hadoop/cdy/testfile/test2.txt,file:///home/hadoop/
cdy/testfile/test3.txt,file:///home/hadoop/cdy/testfile/test4.txt,file:///home/hadoop/cdy/testfile/test5.txt,file:///home/hadoop/cdy/testfile/test6.txt,file:
///home/hadoop/cdy/testfile/test7.txt,file:///home/hadoop/cdy/testfile/test8.txt,file:///home/hadoop/cdy/testfile/test9.txt
topic=connect-test1
policy.class=com.github.mmolimar.kafka.connect.fs.policy.SimplePolicy
policy.recursive=true
policy.regexp=.*
file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.TextFileReader
Kafka-connect-fs测试脚本
[hadoop@hadoop001 workplace]$ more kafka-connect-test.sh
#!/bin/bash
rm -rf /home/hadoop/cdy/source-offset
touch /home/hadoop/cdy/source-offset
START=`date +%s`
dir=/home/hadoop/cdy/testfile
for file in `ls $dir`
do new=`echo $file|sed's/.COMPLETED//g'`
mv $dir/$file$dir/$new
done
connect-standalone.sh/opt/beh/core/kafka/config/connect-standalone.properties /home/hadoop/cdy/workplace/kafka-connect-fs.properties2>&1 > ~/kafka.log &
while true
do
sleep 0.1
line=`strings/home/hadoop/cdy/source-offset|grep 3000000|wc -l`
if [ $line -eq 10 ];then
END=`date +%s`
break
fi
done
echo $END $START|awk '{print $1-$2}'
5 测试结果
测试耗时(s) 工具 |
第一次 |
第二次 |
第三次 |
第四次 |
第五次 |
Kafka-connect-fs并行 |
151 |
142 |
212 |
141 |
131 |
Kafka-connect-fs串行 |
321 |
346 |
340 |
355 |
356 |
Flume |
555 |
554 |
551 |
556 |
485 |
测试结果数据柱状图
connect并行方式消费平均19.3万条/秒
connect串行方式消费平均1.7万条/秒
Flume方式消费平均1.1万条/秒
6 分析总结
Kafka-connect-fs 方式
对于指定消费多个文件或者目录可以指定tasks.max最大并发度,进行并行消费数据文件,非常高效;
消费过程维护offset.storage.file.filename文件保存每个消费文件的offset,程序中断重启服务可以继续从中断处消费;
大文件的并行方式的消费,对主机的资源需求非常大,易出现内存溢出,需注意;
采用串行方式比较稳定;
Flume方式
对于并行消费多个文件或者目录,需要手动构建多个agent代理任务,不可以指定并行任务数;
每消费完一个数据文件,就会在文件末尾打上完成标记,如果任务中断,重启任务消费一部分的任务将会回滚,一批数据将被重发;
以上是关于Flume与Kafka-connect性能比较的主要内容,如果未能解决你的问题,请参考以下文章