Flume_企业中日志处理

Posted 2020-08-20

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Flume_企业中日志处理相关的知识，希望对你有一定的参考价值。

企业中的日志存放_1

201611/20161112.log.tmp
　　第二天文件变为20161112.log与20161113.log.tmp
拷贝一份flume-conf.properties.template改名为dir-mem-hdfs.properties
实现监控某一目录,如有新文件产生则上传至hdfs,另外过滤掉新文件中tmp文件
dir-mem-hdfs.properties
　　a1.sources = s1
　　a1.channels = c1
　　a1.sinks = k1
　　# defined the source
　　a1.sources.s1.type = spooldir
　　a1.sources.s1.spoolDir = /opt/data/log_hive/20161109
　　a1.sources.s1.includePattern = ([^ ]*\.log$) # 包含某些字段
　　a1.sources.s1.ignorePattern = ([^ ]*\.tmp$)  # 忽略某些字段
　　# defined the channel
　　a1.channels.c1.type = memory
　　a1.channels.c1.capacity = 1000
　　a1.channels.c1.transactionCapacity = 1000
　　# defined the sink
　　a1.sinks.k1.type = hdfs
　　a1.sinks.k1.hdfs.useLocalTimeStamp = true
　　a1.sinks.k1.hdfs.path = /flume/spdir
　　a1.sinks.k1.hdfs.fileType = DataStream 
　　a1.sinks.k1.hdfs.rollInterval = 0
　　a1.sinks.k1.hdfs.rollSize = 20480
　　a1.sinks.k1.hdfs.rollCount = 0
　　# The channel can be defined as follows.
　　a1.sources.s1.channels = c1
　　a1.sinks.k1.channel = c1
flmue目录下执行
　　bin/flume-ng agent -c conf/ -n a1 -f conf/dir-mem-hdfs.properties -Dflume.root.logger=INFO,console
　　这里使用了memory channel,可以使用file channel更加安全

企业中的日志存放_2

201611/20161112.log
　　第二天文件继续往20161112.log写
这样,既要使用exec和spoolingdir,如何处理
编译flume1.7版tail dir source,并集成到我们已有的flume环境
　　1. window上下载安装git 
　　2. 在某个目录下加一个空的文件夹(文件夹路径尽量不要有中文),例GitHub
　　3. 使用github常用命令
　　　　$ pwd
　　　　$ ls
　　　　$ cd /C/Users/Administrator/Desktop/GitHub
　　　　$ git clone (https|git)://github.com/apache/flume.git
　　　　$ cd flume
　　　　$ git branch -r # 查看有哪些分支
　　　　$ git branch -r # 查看当前属于哪个分支
　　　　$ git checkout origin/flume-1.7 #别换分支
　　拷贝flume\flume-ng-sources\flume-taildir-source
　　使用eclipse导入flume-taildir-source项目
　　修改pom.xml
　　<repositories>
　　　　<repository>
　　　　　　<id>cloudera</id>
　　　　　　<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
　　　　</repository>
　　</repositories>
　　<modelVersion>4.0.0</modelVersion>
　　<groupId>org.apache.flume.flume-ng-sources</groupId>
　　<artifactId>flume-taildir-source</artifactId>
　　<version>1.5.0-cdh5.3.6</version>
　　<name>Flume Taildir Source</name>
　　<build>
　　　　<plugins>
　　　　　　<plugin>
　　　　　　　　<groupId>org.apache.maven.plugins</groupId>
　　　　　　　　<artifactId>maven-compiler-plugin</artifactId>
　　　　　　　　<version>2.3.2</version>
　　　　　　　　<configuration>
　　　　　　　　　　<source>1.7</source>
　　　　　　　　　　<target>1.7</target>
　　　　　　　　</configuration>
　　　　　　</plugin>
　　　　</plugins>
　　</build>
　　<dependencies>
　　　　<dependency>
　　　　　　<groupId>org.apache.flume</groupId>
　　　　　　<artifactId>flume-ng-core</artifactId>
　　　　　　<version>1.5.0-cdh5.3.6</version>
　　　　</dependency>
　　　　<dependency>
　　　　　　<groupId>junit</groupId>
　　　　　　<artifactId>junit</artifactId>
　　　　　　<version>4.10</version>
　　　　　　<scope>test</scope>
　　　　</dependency>
　　</dependencies>
　　4. MAVEN_BULID项目,获取jar包并放到当前flume的环境中(lib目录) 
　　5. 创建文件夹和文件
　　　　$ mkdir -p /opt/cdh-5.6.3/apache-flume-1.5.0-cdh5.3.6-bin/position
　　　　$ mkdir -p /opt/data/tail/hadoop-dir/
　　　　$ echo "" > /opt/data/tail/hadoop.log
　　　　拷贝一份flume-conf.properties.template改名为tail-mem-hdfs.properties
　　　　可从源码看出需要的参数
　　　　　　a1.sources = s1
　　　　　　a1.channels = c1
　　　　　　a1.sinks = k1
　　　　　　# defined the source
　　　　　　a1.sources.s1.type = org.apache.flume.source.taildir.TaildirSource
　　　　　　a1.sources.s1.positionFile = /opt/cdh-5.6.3/apache-flume-1.5.0-cdh5.3.6-bin/position/taildir_position.json
　　　　　　a1.sources.s1.filegroups = f1 f2
　　　　　　a1.sources.s1.filegroups.f1 = /opt/data/tail/hadoop.log
　　　　　　a1.sources.s1.filegroups.f2 = /opt/data/tail/hadoop-dir/.*
　　　　　　a1.sources.s1.headers.f1.headerKey1 = value1
　　　　　　a1.sources.s1.headers.f2.headerKey1 = value2-1
　　　　　　a1.sources.s1.headers.f2.headerKey2 = value2-2
　　　　　　a1.sources.s1.fileHeader = true
　　　　　　# defined the channel
　　　　　　a1.channels.c1.type = memory
　　　　　　a1.channels.c1.capacity = 1000
　　　　　　a1.channels.c1.transactionCapacity = 1000
　　　　　　# defined the sink
　　　　　　a1.sinks.k1.type = hdfs
　　　　　　a1.sinks.k1.hdfs.useLocalTimeStamp = true
　　　　　　a1.sinks.k1.hdfs.path = /flume/spdir
　　　　　　a1.sinks.k1.hdfs.fileType = DataStream 
　　　　　　a1.sinks.k1.hdfs.rollInterval = 0
　　　　　　a1.sinks.k1.hdfs.rollSize = 20480
　　　　　　a1.sinks.k1.hdfs.rollCount = 0
　　　　　　# The channel can be defined as follows.
　　　　　　a1.sources.s1.channels = c1
　　　　　　a1.sinks.k1.channel = c1
　　flmue目录下执行
　　　　bin/flume-ng agent -c conf/ -n a1 -f conf/tail-mem-hdfs.properties -Dflume.root.logger=INFO,console
　　　　测试文件或新数据

企业中常用架构 Flume多sink

同一份数据采集到不同框架处理
采集source: 一份数据
管道channel: 多个
目标sink: 多个
如果多个sink从一个channel取数据将取不完整,而source会针对channel分别发送
设计: source--hive.log channel--file sink--hdfs(不同路径)
拷贝一份flume-conf.properties.template改名为hive-file-sinks.properties
hive-file-sinks.properties
　　a1.sources = s1
　　a1.channels = c1 c2
　　a1.sinks = k1 k2
　　# defined the source
　　a1.sources.s1.type = exec
　　a1.sources.s1.command = tail -F /opt/cdh-5.6.3/hive-0.13.1-cdh5.3.6/logs/hive.log
　　a1.sources.s1.shell = /bin/sh -c
　　# defined the channel 1
　　a1.channels.c1.type = file
　　a1.channels.c1.checkpointDir = /opt/cdh-5.6.3/apache-flume-1.5.0-cdh5.3.6-bin/datas/checkp1
　　a1.channels.c1.dataDirs = /opt/cdh-5.6.3/apache-flume-1.5.0-cdh5.3.6-bin/datas/data1
　　# defined the channel 2
　　a1.channels.c2.type = file
　　a1.channels.c2.checkpointDir = /opt/cdh-5.6.3/apache-flume-1.5.0-cdh5.3.6-bin/datas/checkp2
　　a1.channels.c2.dataDirs = /opt/cdh-5.6.3/apache-flume-1.5.0-cdh5.3.6-bin/datas/data2
　　# defined the sink 1
　　a1.sinks.k1.type = hdfs
　　a1.sinks.k1.hdfs.path = /flume/hdfs/sink1
　　a1.sinks.k1.hdfs.fileType = DataStream 
　　# defined the sink 2
　　a1.sinks.k2.type = hdfs
　　a1.sinks.k2.hdfs.path = /flume/hdfs/sink2
　　a1.sinks.k2.hdfs.fileType = DataStream 
　　# The channel can be defined as follows.
　　a1.sources.s1.channels = c1 c2
　　a1.sinks.k1.channel = c1
　　a1.sinks.k2.channel = c2		
flmue目录下执行
　　bin/flume-ng agent -c conf/ -n a1 -f conf/hive-file-sinks.properties -Dflume.root.logger=INFO,console
hive目录下执行
　　bin/hive -e "show databases"

以上是关于Flume_企业中日志处理的主要内容，如果未能解决你的问题，请参考以下文章

Flume 概述/企业案例

flume 日志导入elasticsearch

使用Flume+Kafka+SparkStreaming进行实时日志分析

Flume在企业大数据仓库架构中位置及功能

Flume初始

日志采集框架Flume