flume 1.8+Hadoop2.0

Posted 2021-04-13 Spark推荐系统

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了flume 1.8+Hadoop2.0相关的知识，希望对你有一定的参考价值。

Spark推荐系统，干货，心得

点击上方蓝字关注～

环境背景

环境hadoop：三台节点master slave1 slave2

环境Flume:apache-flume-1.8.0

从agent服务器收集tomcat日志到hadoop集群hdfs

flume部署

apache-flume-1.8.0-bin.tar.gz

tar -zxvf apache-flume-1.8.0-bin.tar.gz

cd /usr/local/src/apache-flume-1.8.0-bin/conf

cp flume-env.sh.template flume-env.sh修改JAVA_HOME

export JAVA_HOME= /usr/local/src/jdk1.8.0_31

Flume与Hdfs结合conf配置

增加配置文件

cd $FLUME_HOME/conf

vim single_agent.conf

#Name the components on this agent 自定义名字

a1.sources = r1

a1.sinks = k1

a1.channels = c1

#Describe/configure the source

a1.sources.r1.type = TAILDIR #sources的类型，TAILDIR 对目录进行动态监控，需要1.7版本以上支持

a1.sources.r1.filegroups = f1 # 需要监控的文件组f1

a1.sources.r1.filegroups.f1 = /tmp/liuye/.*log.* #定义f1文件组

a1.sources.r1.fileHeader= flase # 文件头

a1.sources.r1.interceptors=i1 #设置过滤器i1

a1.sources.r1.interceptors.i1.type=regex_filter #过滤器类型，正则匹配

a1.sources.r1.interceptors.i1.regex=.*images.* #设置正则匹配原则

#a1.sources.r1.interceptors.i2.type=timestamp #过滤器类型，时间戳

#Describe the sink

a1.sinks.k1.type = hdfs # sink类型为存储到hdfs

a1.sinks.k1.hdfs.path = hdfs://master:9000/icrm/%y-%m-%d/ #hdfs存储的路径

a1.sinks.k1.hdfs.filePrefix = %{fileName} #存储的文件头名字

#a1.sinks.k1.hdfs.fileSuffix = .lzo #存储文件的压缩格式#

a1.sinks.k1.hdfs.fileSuffix = .log #存储的文件尾名字

a1.sinks.k1.hdfs.round = true # 是否启用时间上的”舍弃

a1.sinks.k1.hdfs.rollSize = 0 #当临时文件达到该大小（单位：bytes）时，滚动成目标文件

a1.sinks.k1.hdfs.fileTpye = DataStream #文件输出格式

a1.sinks.k1.hdfs.writeFormat = Text #写sequence文件的格式。包含：Text, Writable（默认）

a1.sinks.k1.hdfs.rollCount = 0 #当events数据达到该数量时候，将临时文件滚动成目标文件

a1.sinks.k1.hdfs.rollInterval = 600 #文件滚动时间

#hdfs sink间隔多长将临时文件滚动成最终目标文件，单位：秒；

#注：滚动（roll）指的是，hdfs sink将临时文件重命名成最终目标文件，并新打开一个临时文件来写入数据；

a1.sinks.k1.hdfs.rollSize=0 #文件滚动大小，当临时文件达到该大小（单位：bytes）时，滚动成目标文件；

a1.sinks.k1.hdfs.useLocalTimeStamp = true #使用本地时间戳

a1.sinks.k1.hdfs.minBlockReplicas=1 #文件块备份个数。此处设置为1 否则hdfs会产生很多小文件#

#Use a channel which buffers events in memory

a1.channels.c1.type = memory #隧道名称类型memory 写到内存吞吐率高，但存在丢失数据风险

#file 。用的是wal

a1.channels.c1.capacity = 1000 #存储在channel中的最大容量

a1.channels.c1.transactionCapacity = 100 #从一个source中去或者给一个sink，每个事务中最大的事件

#Bind the source and sink to the channel #设置source、chanel、sinks连接模式

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

运行测试

flume-ng agent --conf conf --conf-file ./conf/flume_inc_hdfs.conf --name a1 - Dflume.root.logger=INFO,console >> ./logs/flume.log 2>&1 &

ps -ef|grep flume 查看是否启动成功

查看HDFS是否有日志生成

flume 1.8+Hadoop2.0