Flume与Kafka-connect性能比较

Posted 2021-04-30 一木呈

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Flume与Kafka-connect性能比较相关的知识，希望对你有一定的参考价值。

1 系统环境

主机名称

CPU

网卡

内存

Hadoop001

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz*2核

Speed: 10000Mb/s

Size: 16384 MB * 8 条

Speed: 2133 MHz

/sys/kernel/mm/transparent_hugepage/enabled is 'never'

/sys/kernel/mm/transparent_hugepage/defrag is 'never'

2 测试环境

2.1 Flume：

Flume 1.8.0

Kafka 0.11.0.1

$ more /opt/beh/core/kafka/bin/kafka-run-class.sh|grep KAFKA_HEAP_OPTS=

KAFKA_HEAP_OPTS="-Xmx15000M"

2.2 Kafka-connect-fs：

Kafka-connect-fs 0.3

Kafka 0.11.0.1

$ more/opt/beh/core/flume/conf/flume-env.sh|grep JAVA_OPTS

export JAVA_OPTS="-Xms10240m-Xmx15000m -Dcom.sun.management.jmxremote"

3 测试数据

测试数据总计680MB，每个文件总计3000000条数据：

[hadoop@hadoop001 testfile]$ cat test0.txt|head-3

my test message 0

my test message 1

my test message 3

[hadoop@hadoop001 testfile]$ ll -h

total 677M

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test0.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test1.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test2.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test3.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test4.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test5.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test6.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:49 test7.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:50 test8.txt

-rw-rw-r-- 1 kafka kafka 68M Jan 6 12:50 test9.txt

4 测试过程

4.1 Flume消费

创建Flume生产指定topic

Kafka-topics.sh --create --topic flume_log --zookeeper 172.16.40.116:2181,172.16.40.114:2181,172.16.40.115:2181/kafka011 --replication-factor1 --partitions 3

Flume-ng 配置

[hadoop@hadoop001 workplace]$ more flume_log.conf

#Name the components on this agent

agent.sources= r1

agent.sinks= k1

agent.channels= c1

#Describe/configure the source

agent.sources.r1.type= spooldir

agent.sources.r1.spoolDir=/home/hadoop/cdy/testfile

agent.sources.r1.channels= c1

#Use a channel which buffers events in memory

agent.channels.c1.type= memory

agent.channels.c1.capacity= 10000

agent.channels.c1.transactionCapacity= 2000

#Describe/configure the source

agent.sinks.k1.type=org.apache.flume.sink.kafka.KafkaSink

agent.sinks.k1.kafka.topic = flume_log

agent.sinks.k1.channel= c1

agent.sinks.k1.kafka.bootstrap.servers =172.16.40.116:9092,172.16.40.114:9092,172.16.40.115:9092

agent.sinks.k1.kafka.flumeBatchSize = 20

agent.sinks.k1.kafka.producer.acks = 1

Flume测试脚本

[hadoop@hadoop001 workplace]$ more flume_test.sh

#!/bin/bash

dir=/home/hadoop/cdy/testfile

for file in `ls $dir`

donew=`echo $file|sed 's/.COMPLETED//g'`

mv$dir/$file $dir/$new

done

START=`date +%s`

$FLUME_HOME/bin/flume-ng agent --conf conf--conf-file /home/hadoop/cdy/workplace/flume_log.conf --name agent-Dflume.root.logger=DEBUG,console 2>&1 >/dev/nu

ll &

while true

file=`ls/home/hadoop/cdy/testfile|grep -i COMPLETED|wc -l`

if [ $file-eq 10 ];then

END=`date +%s`

break

sleep 0.1

done

echo $END $START|awk '{print $1-$2}'

4.2 Kafka-connect-fs消费

创建kafka-connect-fs消费指定的topic

kafka-topics.sh --create –-topic connect-test--zookeeper 172.16.40.116:2181,172.16.40.114:2181,172.16.40.115:2181/kafka011 --replication-factor1 --partitions 3

Kafka-connect-fs并行消费配置

[hadoop@hadoop001 workplace]$ more kafka-connect-fs.properties

name=FsSourceConnector

connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector

tasks.max=10

fs.uris=file:///home/hadoop/cdy/testfile/test0.txt,file:///home/hadoop/cdy/testfile/test1.txt,file:///home/hadoop/cdy/testfile/test2.txt,file:///home/hadoop/

cdy/testfile/test3.txt,file:///home/hadoop/cdy/testfile/test4.txt,file:///home/hadoop/cdy/testfile/test5.txt,file:///home/hadoop/cdy/testfile/test6.txt,file:

///home/hadoop/cdy/testfile/test7.txt,file:///home/hadoop/cdy/testfile/test8.txt,file:///home/hadoop/cdy/testfile/test9.txt

topic=connect-test

policy.class=com.github.mmolimar.kafka.connect.fs.policy.SimplePolicy

policy.recursive=true

policy.regexp=.*

file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.TextFileReader

Kafka-connect-fs串行消费配置

[hadoop@hadoop001 workplace]$ morekafka-connect1-fs.properties

name=FsSourceConnector

connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector

tasks.max=1

fs.uris=file:///home/hadoop/cdy/testfile/test0.txt,file:///home/hadoop/cdy/testfile/test1.txt,file:///home/hadoop/cdy/testfile/test2.txt,file:///home/hadoop/

cdy/testfile/test3.txt,file:///home/hadoop/cdy/testfile/test4.txt,file:///home/hadoop/cdy/testfile/test5.txt,file:///home/hadoop/cdy/testfile/test6.txt,file:

///home/hadoop/cdy/testfile/test7.txt,file:///home/hadoop/cdy/testfile/test8.txt,file:///home/hadoop/cdy/testfile/test9.txt

topic=connect-test1

policy.class=com.github.mmolimar.kafka.connect.fs.policy.SimplePolicy

policy.recursive=true

policy.regexp=.*

file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.TextFileReader

Kafka-connect-fs测试脚本

[hadoop@hadoop001 workplace]$ more kafka-connect-test.sh

#!/bin/bash

rm -rf /home/hadoop/cdy/source-offset

touch /home/hadoop/cdy/source-offset

START=`date +%s`

dir=/home/hadoop/cdy/testfile

for file in `ls $dir`

do new=`echo $file|sed's/.COMPLETED//g'`

mv $dir/$file$dir/$new

done

connect-standalone.sh/opt/beh/core/kafka/config/connect-standalone.properties /home/hadoop/cdy/workplace/kafka-connect-fs.properties2>&1 > ~/kafka.log &

while true

sleep 0.1

line=`strings/home/hadoop/cdy/source-offset|grep 3000000|wc -l`

if [ $line -eq 10 ];then

END=`date +%s`

break

done

echo $END $START|awk '{print $1-$2}'

5 测试结果

测试耗时（s）工具	第一次	第二次	第三次	第四次	第五次
Kafka-connect-fs并行	151	142	212	141	131
Kafka-connect-fs串行	321	346	340	355	356
Flume	555	554	551	556	485

测试结果数据柱状图

connect并行方式消费平均19.3万条/秒

connect串行方式消费平均1.7万条/秒

Flume方式消费平均1.1万条/秒

6 分析总结

Kafka-connect-fs 方式

对于指定消费多个文件或者目录可以指定tasks.max最大并发度,进行并行消费数据文件，非常高效；

消费过程维护offset.storage.file.filename文件保存每个消费文件的offset，程序中断重启服务可以继续从中断处消费；

大文件的并行方式的消费，对主机的资源需求非常大，易出现内存溢出，需注意；

采用串行方式比较稳定；

Flume方式

对于并行消费多个文件或者目录，需要手动构建多个agent代理任务，不可以指定并行任务数；

每消费完一个数据文件，就会在文件末尾打上完成标记，如果任务中断，重启任务消费一部分的任务将会回滚，一批数据将被重发；

以上是关于Flume与Kafka-connect性能比较的主要内容，如果未能解决你的问题，请参考以下文章