Flume学习之路 Flume的Source类型

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Flume学习之路 Flume的Source类型相关的知识,希望对你有一定的参考价值。

一、概述

官方文档介绍
http://flume.apache.org/FlumeUserGuide.html#flume-sources

二、Flume Sources 描述

2.1 Avro Source

2.1.1 介绍

Avro端口监听并接收来自外部的Avro客户流的事件。当内置Avro去Sinks另一个配对Flume代理,它就可以创建分层采集的拓扑结构。官网说的比较绕,当然我的翻译也很弱,其实就是flume可以多级代理,然后代理与代理之间用Avro去连接。==字体加粗的属性必须进行设置==。

Property Name Default Description
channels
type The component type name, needs to be avro
bind hostname or IP address to listen on
port Port # to bind to
threads Maximum number of worker threads to spawn
selector.type
selector.*
interceptors Space-separated list of interceptors
interceptors.*
compression-type none This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl false Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.
keystore This is the path to a Java keystore file. Required for SSL.
keystore-password The password for the Java keystore. Required for SSL.
keystore-type JKS The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols SSLv3 Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
ipFilter false Set this to true to enable ipFiltering for netty
ipFilterRules Define N netty ipFilter pattern rules with this config.

2.1.2 示例

示例请参考官方文档

进入flume文件中的conf目录下,创建一个a1.conf文件。定义:sinks,channels,sources

#a1.conf:单节点Flume配置

# 命名此代理上的组件
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#配置sources
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#配置sinks
a1.sinks.k1.type = logger

#配置channels
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#为sources和sinks绑定channels
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

启动 Flume

[[email protected] flume]# bin/flume-ng agent --conf conf --conf-file conf/a1.conf --name a1 -Dflume.root.logger=INFO,console
或者
[[email protected] flume]# bin/flume-ng agent -c conf -f conf/a1.conf -n a1 -Dflume.root.logger=INFO,console

测试 Flume

重新打开一个终端,我们可以telnet端口44444并向Flume发送一个事件:

[[email protected] ~]# telnet localhost 44444
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is ‘^]‘.
Hello world! <ENTER>  # 输入的内容
OK

原始的Flume终端将在日志消息中输出事件:

2018-11-02 15:29:47,203 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:155)] Source starting
2018-11-02 15:29:47,214 (lifecycleSupervisor-1-0) [INFO - org.apache.flume.source.NetcatSource.start(NetcatSource.java:166)] CreatedserverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]
2018-11-02 15:29:58,507 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 48 65 6C 6C 6F 20 57 6F 72 6C 64 21 0D          Hello World!. }

2.2 Thrift Source

ThriftSource 与Avro Source 基本一致。只要把source的类型改成thrift即可,例如a1.sources.r1.type = thrift,比较简单,不做赘述。

Property Name Default Description
channels
type The component type name, needs to be thrift
bind hostname or IP address to listen on
port Port # to bind to
threads Maximum number of worker threads to spawn
selector.type
selector.*
interceptors Space separated list of interceptors
interceptors.*
ssl false Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.
keystore This is the path to a Java keystore file. Required for SSL.
keystore-password The password for the Java keystore. Required for SSL.
keystore-type JKS The type of the Java keystore. This can be “JKS” or “PKCS12”.
exclude-protocols SSLv3 Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols specified.
kerberos false Set to true to enable kerberos authentication. In kerberos mode, agent-principal and agent-keytab are required for successful authentication. The Thrift source in secure mode, will accept connections only from Thrift clients that have kerberos enabled and are successfully authenticated to the kerberos KDC.
agent-principal The kerberos principal used by the Thrift Source to authenticate to the kerberos KDC.
agent-keytab The keytab location used by the Thrift Source in combination with the agent-principal to authenticate to the kerberos KDC.

2.3 Exec Source

2.3.1 介绍

ExecSource的配置就是设定一个Unix(linux)命令,然后通过这个命令不断输出数据。如果进程退出,Exec Source也一起退出,不会产生进一步的数据。

下面是官网给出的source的配置,加粗的参数是必选,描述就不解释了。

Property Name Default Description
channels
type The component type name, needs to be exec
command The command to execute
shell A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards, back ticks, pipes etc.
restartThrottle 10000 Amount of time (in millis) to wait before attempting a restart
restart false Whether the executed cmd should be restarted if it dies
logStdErr false Whether the command’s stderr should be logged
batchSize 20 The max number of lines to read and send to the channel at a time
batchTimeout 3000 Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream
selector.type replicating replicating or multiplexing
selector.* Depends on the selector.type value
interceptors Space-separated list of interceptors
interceptors.*

2.3.2 示例

创建一个a2.conf文件

#配置文件
#Name the components on this agent  
a1.sources= s1
a1.sinks= k1
a1.channels= c1

#配置sources
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/flume/test.log
a1.sources.s1.channels = c1

#配置sinks 
a1.sinks.k1.type= logger
a1.sinks.k1.channel= c1

#配置channel
a1.channels.c1.type= memory

启动 Flume

[[email protected] flume]# ./bin/flume-ng agent --conf conf --conf-file ./conf/a2.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true

测试 Flume

重新打开一个终端,我们往监听的日志里添加数据:

[[email protected] ~]# echo "hello world" >> test.log

原始的Flume终端将在日志消息中输出事件:

2018-11-03 03:47:32,508 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world }

2.4 JMS Source

2.4.1 介绍

从JMS系统(消息、主题)中读取数据,ActiveMQ已经测试过

Property Name Default Description
channels
type The component type name, needs to be jms
initialContextFactory Inital Context Factory, e.g: org.apache.activemq.jndi.ActiveMQInitialContextFactory
connectionFactory The JNDI name the connection factory should appear as
providerURL The JMS provider URL
destinationName Destination name
destinationType Destination type (queue or topic)
messageSelector Message selector to use when creating the consumer
userName Username for the destination/provider
passwordFile File containing the password for the destination/provider
batchSize 100 Number of messages to consume in one batch
converter.type DEFAULT Class to use to convert messages to flume events. See below.
converter.* Converter properties.
converter.charset UTF-8 Default converter only. Charset to use when converting JMS TextMessages to byte arrays.
createDurableSubscription false Whether to create durable subscription. Durable subscription can only be used with destinationType topic. If true, “clientId” and “durableSubscriptionName” have to be specified.
clientId JMS client identifier set on Connection right after it is created. Required for durable subscriptions.
durableSubscriptionName Name used to identify the durable subscription. Required for durable subscriptions.

2.4.2 官网示例

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = jms
a1.sources.r1.channels = c1
a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory
a1.sources.r1.connectionFactory = GenericConnectionFactory
a1.sources.r1.providerURL = tcp://mqserver:61616
a1.sources.r1.destinationName = BUSINESS_DATA
a1.sources.r1.destinationType = QUEUE

2.5 Spooling Directory Source

2.5.1 介绍

Spooling Directory Source监测配置的目录下新增的文件,并将文件中的数据读取出来。其中,Spool Source有2个注意地方,第一个是拷贝到spool目录下的文件不可以再打开编辑,第二个是spool目录下不可包含相应的子目录。这个主要用途作为对日志的准实时监控。

下面是官网给出的source的配置,加粗的参数是必选。可选项太多,这边就介绍一个fileSuffix,即文件读取后添加的后缀名,这个是可以更改。

Property Name Default Description
channels
type The component type name, needs to be spooldir.
spoolDir The directory from which to read files from.
fileSuffix .COMPLETED Suffix to append to completely ingested files

2.5.2 示例

创建一个a3.conf文件

a1.sources = s1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.s1.type =spooldir
a1.sources.s1.spoolDir =/opt/flume/logs
a1.sources.s1.fileHeader= true
a1.sources.s1.channels =c1

# Describe the sink
a1.sinks.k1.type = logger
a1.sinks.k1.channel = c1

# Use a channel which buffers events inmemory
a1.channels.c1.type = memory

启动 Flume

[[email protected] flume]# ./bin/flume-ng agent --conf conf --conf-file ./conf/a3.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true

重新打开一个终端,我们将test.log移动到logs目录:

[[email protected] flume]# cp test.log logs/

原始的Flume终端将在日志消息中输出事件:

2018-11-03 03:54:54,207 (pool-3-thread-1) [INFO - org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents(ReliableSpoolingFileEventReader.java:324)] Last read took us just up to a file boundary. Rolling to the next file, if there is one.
2018-11-03 03:54:54,207 (pool-3-thread-1) [INFO - org.apache.flume.client.avro.ReliableSpoolingFileEventReader.rollCurrentFile(ReliableSpoolingFileEventReader.java:433)] Preparing to move file /opt/flume/logs/test.log to /opt/flume/logs/test.log.COMPLETED

2.6 NetCat Source

2.6.1 介绍

Netcat source 在某一端口上进行侦听,它将每一行文字变成一个事件源,也就是数据是基于换行符分隔。它的工作就像命令nc?-k?-l?[host]?[port] 换句话说,它打开一个指定端口,侦听数据将每一行文字变成Flume事件,并通过连接通道发送。

下面是官网给出的source的配置,加粗的参数是必选。

Property Name Default Description
channels
type The component type name, needs to be?netcat
bind Host name or IP address to bind to
port Port # to bind to
max-line-length 512 Max line length per event body (in bytes)
ack-every-event TRUE Respond with an “OK” for every event received
selector.type replicating replicating or multiplexing
selector.* Depends on the selector.type value
interceptors Space-separated list of interceptors
interceptors.*

2.6.2 示例

实际例子,见 2.3.2 例子就是 Netcat source,这里不演示了。

2.7 Sequence Generator Source

一个简单的序列发生器,不断产成与事件计数器0和1的增量开始。主要用于测试(官网说),这里也不做赘述。

2.8 Syslog Sources

读取syslog数据,并生成Flume 事件。 这个Source分成三类SyslogTCP Source、

Multiport Syslog TCP Source(多端口)与SyslogUDP Source。其中TCP Source为每一个用回车( n)来分隔的字符串创建一个新的事件。而UDP Source将整个消息作为一个单一的事件。

下面是官网给出的source的配置,加粗的参数是必选。

Property Name Default Description
channels
type The component type name, needs to be syslogtcp
host Host name or IP address to bind to
port Port # to bind to
eventSize 2500 Maximum size of a single event line, in bytes
keepFields none Setting this to ‘all’ will preserve the Priority, Timestamp and Hostname in the body of the event. A spaced separated list of fields to include is allowed as well. Currently, the following fields can be included: priority, version, timestamp, hostname. The values ‘true’ and ‘false’ have been deprecated in favor of ‘all’ and ‘none’.
selector.type replicating or multiplexing
selector.* replicating Depends on the selector.type value
interceptors Space-separated list of interceptors
interceptors.*

2.8.1 Syslog TCPSource

2.8.1.1 介绍

这个是最初的Syslog Sources

下面是官网给出的source的配置,加粗的参数是必选,这里可选我省略了。

Property Name Default Description
channels
type The component type name, needs to be syslogtcp
host Host name or IP address to bind to
port Port # to bind to

2.8.1.2 示例

官方配置

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1

创建一个a4.conf文件

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = syslogtcp
a1.sources.r1.port = 50000
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1

# Describe the sink
a1.sinks.k1.type = logger
 a1.sinks.k1.channel = c1

# Use a channel which buffers events inmemory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

这里我们设置的侦听端口为localhost 50000
启动 Flume

[[email protected] flume]# ./bin/flume-ng agent --conf conf --conf-file ./conf/a4.conf --name a1 -Dflume.root.logger=INFO,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true

测试 Flume

重新打开一个终端,我们往监听端口发送数据:

[[email protected] ~]# echo "hello world" | nc localhost 50000

原始的Flume终端将在日志消息中输出事件:

2018-11-03 04:47:34,518 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world }

2.8.2 Multiport Syslog TCP Source

2.8.2.1 介绍

这是一个更新,更快,支持多端口版本的SyslogTCP Source。他不仅仅监控一个端口,还可以监控多个端口。官网配置基本差不多,就是可选配置比较多。

Property Name Default Description
channels
type The component type name, needs to be?multiport_syslogtcp
host Host name or IP address to bind to.
ports Space-separated list (one or more) of ports to bind to.
portHeader If specified, the port number will be stored in the header of each event using the header name specified here. This allows for interceptors and channel selectors to customize routing logic based on the incoming port.

这里说明下需要注意的是这里ports设置已经取代tcp 的port,这个千万注意。还有portHeader这个可以与后面的interceptors 与 channel selectors自定义逻辑路由使用。

2.8.2.2 示例

官方配置

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = multiport_syslogtcp
a1.sources.r1.channels = c1
a1.sources.r1.host = 0.0.0.0
a1.sources.r1.ports = 10001 10002 10003
a1.sources.r1.portHeader = port

创建一个a5.conf文件

# Name thecomponents on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#Describe/configure the source
a1.sources.r1.type = multiport_syslogtcp
a1.sources.r1.ports = 50000 60000
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1

# Describe thesink
a1.sinks.k1.type= logger
 a1.sinks.k1.channel = c1

# Use a channelwhich buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity= 1000
a1.channels.c1.transactionCapacity= 100

这里我们侦听 localhost 的2个端口50000与60000

启动 Flume

[[email protected] flume]# ./bin/flume-ng agent --conf conf --conf-file ./conf/a5.conf --name a1 -Dflume.root.logger=INFO,console

测试 Flume

重新打开一个终端,我们往监听端口发送数据:

[[email protected] ~]# echo "hello world 01" | nc localhost 50000
[[email protected] ~]# echo "hello world 02" | nc localhost 60000

原始的Flume终端将在日志消息中输出事件:

2018-11-03 05:56:34,588 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{flume,.syslog,status=Invalid} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world 01 }
2018-11-03 05:56:34,588 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{flume,.syslog,status=Invalid} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world 02 }

2个端口的数据已经发送过来了。

2.8.2 Syslog UDP Source

2.8.2.1 介绍

其实这个就是与TCP不同的协议而已。
官网配置与TCP一致,就不说了。

2.8.2.1 示例

官方配置

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = syslogudp
a1.sources.r1.port = 5140
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1

创建一个a6.conf文件

# Name thecomponents on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#Describe/configure the source
a1.sources.r1.type = syslogudp
a1.sources.r1.port = 50000
a1.sources.r1.host = localhost
a1.sources.r1.channels = c1

# Describe thesink
a1.sinks.k1.type= logger
 a1.sinks.k1.channel = c1

# Use a channelwhich buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity= 1000
a1.channels.c1.transactionCapacity= 100

这里我们侦听 localhost 的2个端口50000与60000

启动 Flume

[[email protected] flume]# ./bin/flume-ng agent --conf conf --conf-file ./conf/a6.conf --name a1 -Dflume.root.logger=INFO,console

测试 Flume

重新打开一个终端,我们往监听端口发送数据:

[[email protected] ~]# echo "hello world" | nc –u localhost 50000

原始的Flume终端将在日志消息中输出事件:

2018-11-03 06:10:34,768 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{Serverity=0, flume,.syslog,status=Invalid, Facility=0} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello world }

Ok,数据已经发送过来了。

2.9 HTTP Source

2.9.1 介绍

HTTP Source是HTTP POST和GET来发送事件数据的,官网说GET应只用于实验。Flume 事件使用一个可插拔的“handler”程序来实现转换,它必须实现的HTTPSourceHandler接口。此处理程序需要一个HttpServletRequest和返回一个flume 事件列表。

所有在一个POST请求发送的事件被认为是在一个事务里,一个批量插入flume 通道的行为。

下面是官网给出的source的配置,加粗的参数是必选。

Property Name Default Description
type The component type name, needs to be?http
port The port the source should bind to.
bind 0.0.0.0 The hostname or IP address to listen on
handler org.apache.flume.source.http.JSONHandler The FQCN of the handler class.

2.9.2 示例

官方配置

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.example.rest.RestHandler
a1.sources.r1.handler.nickname = random props

创建一个a7.conf文件

#Name the components on this agent
a1.sources= r1
a1.sinks= k1
a1.channels= c1

#Describe/configure the source
a1.sources.r1.type= http
a1.sources.r1.port= 50000
a1.sources.r1.channels= c1

#Describe the sink
a1.sinks.k1.type= logger
 a1.sinks.k1.channel = c1

#Use a channel which buffers events in memory
a1.channels.c1.type= memory
a1.channels.c1.capacity= 1000
a1.channels.c1.transactionCapacity= 100

启动 Flume

[[email protected] flume]# ./bin/flume-ng agent --conf conf --conf-file ./conf/a7.conf --name a1 -Dflume.root.logger=INFO,console

测试 Flume

重新打开一个终端,我们用生成JSON 格式的POSTrequest发数据:

[[email protected] ~]# echo "hello world" | nc –u localhost 50000
curl -X POST -d ‘[{"headers" :{"test1" : "test1 is header","test2" : "test2 is header"},"body" : "hello test3"}]‘ http://localhost:50000

原始的Flume终端将在日志消息中输出事件:

2018-11-03 06:20:56,678 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{test1=test1 is header, test2=test2 is header} body: 68 65 6C 6C 6F 20 77 6F 72 6C 64                hello test2 }

这里headers与body都正常输出。

2.10 自定义Source

一个自定义 Source 其实是对 Source 接口的实现。当我们开始flume代理的时候必须将自定义 Source 和相依赖的jar包放到代理的 classpath 下面。自定义 Source 的 type 就是我们实现 Source 接口对应的类全路径。

以上是关于Flume学习之路 Flume的Source类型的主要内容,如果未能解决你的问题,请参考以下文章

Flume学习之路 Flume的配置方式

flume学习---source

Flume学习之路 Flume的基础介绍

Flume NG 学习笔记Source配置

大数据面试题flume篇

flume学习