Flume与Kafka-connect性能比较

Posted 一木呈

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Flume与Kafka-connect性能比较相关的知识,希望对你有一定的参考价值。


1 系统环境

主机名称

CPU

网卡

内存

Hadoop001

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz*2

   Speed: 10000Mb/s

Size: 16384 MB * 8

Speed: 2133 MHz

/sys/kernel/mm/transparent_hugepage/enabled  is 'never'

/sys/kernel/mm/transparent_hugepage/defrag  is 'never'

 



2 测试环境

2.1 Flume

Flume 1.8.0

Kafka 0.11.0.1

$ more /opt/beh/core/kafka/bin/kafka-run-class.sh|grep KAFKA_HEAP_OPTS=

KAFKA_HEAP_OPTS="-Xmx15000M"

2.2 Kafka-connect-fs:

Kafka-connect-fs 0.3

Kafka 0.11.0.1

$ more/opt/beh/core/flume/conf/flume-env.sh|grep JAVA_OPTS

export JAVA_OPTS="-Xms10240m-Xmx15000m -Dcom.sun.management.jmxremote"



3 测试数据

测试数据总计680MB,每个文件总计3000000条数据:

[hadoop@hadoop001 testfile]$ cat test0.txt|head-3

my test message 0

my test message 1

my test message 3

 

 

[hadoop@hadoop001 testfile]$ ll -h

total 677M

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test0.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test1.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test2.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test3.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test4.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test5.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test6.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:49 test7.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:50 test8.txt

-rw-rw-r-- 1 kafka kafka 68M Jan  6 12:50 test9.txt



4 测试过程

4.1 Flume消费

  • 创建Flume生产指定topic

Kafka-topics.sh --create --topic flume_log --zookeeper 172.16.40.116:2181,172.16.40.114:2181,172.16.40.115:2181/kafka011 --replication-factor1 --partitions 3

  • Flume-ng 配置

[hadoop@hadoop001 workplace]$ more flume_log.conf

#Name the components on this agent 

agent.sources= r1 

agent.sinks= k1 

agent.channels= c1 

  

#Describe/configure the source 

agent.sources.r1.type= spooldir 

agent.sources.r1.spoolDir=/home/hadoop/cdy/testfile

agent.sources.r1.channels= c1

  

#Use a channel which buffers events in memory 

agent.channels.c1.type= memory 

agent.channels.c1.capacity= 10000

agent.channels.c1.transactionCapacity= 2000

  

#Describe/configure the source 

agent.sinks.k1.type=org.apache.flume.sink.kafka.KafkaSink

agent.sinks.k1.kafka.topic = flume_log

agent.sinks.k1.channel= c1 

agent.sinks.k1.kafka.bootstrap.servers =172.16.40.116:9092,172.16.40.114:9092,172.16.40.115:9092

agent.sinks.k1.kafka.flumeBatchSize = 20

agent.sinks.k1.kafka.producer.acks = 1

  • Flume测试脚本

[hadoop@hadoop001 workplace]$ more flume_test.sh

#!/bin/bash

dir=/home/hadoop/cdy/testfile

 

for file in `ls $dir`

  donew=`echo $file|sed 's/.COMPLETED//g'`

     mv$dir/$file $dir/$new

done

 

START=`date +%s`

$FLUME_HOME/bin/flume-ng agent --conf conf--conf-file /home/hadoop/cdy/workplace/flume_log.conf --name agent-Dflume.root.logger=DEBUG,console 2>&1 >/dev/nu

ll &

 

while true

  do

  file=`ls/home/hadoop/cdy/testfile|grep -i COMPLETED|wc -l`

  if [ $file-eq 10 ];then

       END=`date +%s`

        break

  fi

  sleep 0.1

  done

echo $END $START|awk '{print $1-$2}'

4.2 Kafka-connect-fs消费

  • 创建kafka-connect-fs消费指定的topic

kafka-topics.sh --create –-topic connect-test--zookeeper 172.16.40.116:2181,172.16.40.114:2181,172.16.40.115:2181/kafka011 --replication-factor1 --partitions 3

  • Kafka-connect-fs并行消费配置

[hadoop@hadoop001 workplace]$ more kafka-connect-fs.properties

name=FsSourceConnector

connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector

tasks.max=10

fs.uris=file:///home/hadoop/cdy/testfile/test0.txt,file:///home/hadoop/cdy/testfile/test1.txt,file:///home/hadoop/cdy/testfile/test2.txt,file:///home/hadoop/

cdy/testfile/test3.txt,file:///home/hadoop/cdy/testfile/test4.txt,file:///home/hadoop/cdy/testfile/test5.txt,file:///home/hadoop/cdy/testfile/test6.txt,file:

///home/hadoop/cdy/testfile/test7.txt,file:///home/hadoop/cdy/testfile/test8.txt,file:///home/hadoop/cdy/testfile/test9.txt

topic=connect-test

policy.class=com.github.mmolimar.kafka.connect.fs.policy.SimplePolicy

policy.recursive=true

policy.regexp=.*

file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.TextFileReader

  • Kafka-connect-fs串行消费配置

[hadoop@hadoop001 workplace]$ morekafka-connect1-fs.properties

name=FsSourceConnector

connector.class=com.github.mmolimar.kafka.connect.fs.FsSourceConnector

tasks.max=1

fs.uris=file:///home/hadoop/cdy/testfile/test0.txt,file:///home/hadoop/cdy/testfile/test1.txt,file:///home/hadoop/cdy/testfile/test2.txt,file:///home/hadoop/

cdy/testfile/test3.txt,file:///home/hadoop/cdy/testfile/test4.txt,file:///home/hadoop/cdy/testfile/test5.txt,file:///home/hadoop/cdy/testfile/test6.txt,file:

///home/hadoop/cdy/testfile/test7.txt,file:///home/hadoop/cdy/testfile/test8.txt,file:///home/hadoop/cdy/testfile/test9.txt

topic=connect-test1

policy.class=com.github.mmolimar.kafka.connect.fs.policy.SimplePolicy

policy.recursive=true

policy.regexp=.*

file_reader.class=com.github.mmolimar.kafka.connect.fs.file.reader.TextFileReader

  • Kafka-connect-fs测试脚本

[hadoop@hadoop001 workplace]$ more kafka-connect-test.sh

#!/bin/bash

rm -rf /home/hadoop/cdy/source-offset

touch   /home/hadoop/cdy/source-offset

 

START=`date +%s`

dir=/home/hadoop/cdy/testfile

 

for file in `ls $dir`

  do new=`echo $file|sed's/.COMPLETED//g'`

     mv $dir/$file$dir/$new

done

 

connect-standalone.sh/opt/beh/core/kafka/config/connect-standalone.properties  /home/hadoop/cdy/workplace/kafka-connect-fs.properties2>&1 > ~/kafka.log &

 

while true

  do

  sleep 0.1

  line=`strings/home/hadoop/cdy/source-offset|grep 3000000|wc -l`

 

  if [ $line -eq 10 ];then

    END=`date +%s`

    break

  fi

  done

echo $END $START|awk '{print $1-$2}'

 



5 测试结果

          测试耗时(s

工具

 

第一次

 

第二次

 

第三次

 

第四次

 

第五次

Kafka-connect-fs并行

151

142

212

141

131

Kafka-connect-fs串行

321

346

340

355

356

Flume

555

554

551

556

485

 

测试结果数据柱状图

 

connect并行方式消费平均19.3万条/

connect串行方式消费平均1.7万条/

Flume方式消费平均1.1万条/



6 分析总结

  • Kafka-connect-fs 方式

对于指定消费多个文件或者目录可以指定tasks.max最大并发度,进行并行消费数据文件,非常高效;

消费过程维护offset.storage.file.filename文件保存每个消费文件的offset,程序中断重启服务可以继续从中断处消费;

大文件的并行方式的消费,对主机的资源需求非常大,易出现内存溢出,需注意;

采用串行方式比较稳定;

  • Flume方式

对于并行消费多个文件或者目录,需要手动构建多个agent代理任务,不可以指定并行任务数;

每消费完一个数据文件,就会在文件末尾打上完成标记,如果任务中断,重启任务消费一部分的任务将会回滚,一批数据将被重发;


以上是关于Flume与Kafka-connect性能比较的主要内容,如果未能解决你的问题,请参考以下文章

RedHat KVM 与VMware性能比较,哪个更能胜出

storm项目架构分析

Debezium 消息与 kafka-connect sink 连接器期望的格式兼容

爬虫的业务监控方案-Flume

Kafka-Connect实践

Flume的理解