大数据之数据收集

Posted 2020-11-28 boiledwater

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了大数据之数据收集相关的知识，希望对你有一定的参考价值。

大数据之数据收集

数据收集是大数据的基础。散落在各处的数据，只有经过了数据收集，才会集中起来，提供了后续处理的可能。从大数据技术发展以来，出现了很多数据收集的技术框架，本文试图在若干流行的数据收集解决方案上加以叙述。

评估一个技术框架是否适合某个业务场景，通常需要考虑多个方面。

l 最基本的，考虑接口是否适配，收集socket数据了还是log数据，输出到哪里；

l 考虑技术框架的性能，是否满足业务的需求；

l 还需要考虑灵活性，如果需要做一些过滤或者自定义开发，是否容易；

l 考虑对性能的影响，数据收集不能影响了业务系统本身的运行，不能资源消耗太大；

l 考虑运维的难易程度，有的技术方案依赖很多，配置很复杂，就容易出错；

l 考虑技术框架是否高可靠，不会出现丢数据的情况。

相信通过上面几个方面的判断，应该可以找到合适的技术框架。

一、Flume

Flume是一种分布式，可靠且可用的服务，用于有效地收集，聚合和移动大量流事件数据。

source接口适配

Source Type	Comments
Avro Source	Listens on Avro port and receives events from external Avro client streams.
Thrift Source	Listens on Thrift port and receives events from external Thrift client streams.
Exec Source	Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true).
JMS Source	JMS Source reads messages from a JMS destination such as a queue or topic.
SSL and JMS Source	MS client implementations typically support to configure SSL/TLS via some Java system properties defined by JSSE (Java Secure Socket Extension).
Spooling Directory Source	This source lets you ingest data by placing files to be ingested into a “spooling” directory on disk.
Taildir Source	Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files.
Kafka Source	Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics.
NetCat TCP Source	A netcat-like source that listens on a given port and turns each line of text into an event.
NetCat UDP Source	As per the original Netcat (TCP) source, this source that listens on a given port and turns each line of text into an event and sent via the connected channel. Acts like nc -u -k -l [host] [port].
Sequence Generator Source	A simple sequence generator that continuously generates events with a counter that starts from 0, increments by 1 and stops at totalEvents.
Syslog Sources	Reads syslog data and generate Flume events.
Syslog TCP Source	The original, tried-and-true syslog TCP source.
Multiport Syslog TCP Source	his is a newer, faster, multi-port capable version of the Syslog TCP source.
Syslog UDP Source
HTTP Source	A source which accepts Flume Events by HTTP POST and GET.
Stress Source	StressSource is an internal load-generating source implementation which is very useful for stress tests.
Custom Source	A custom source is your own implementation of the Source interface.
Scribe Source	Scribe is another type of ingest system. To adopt existing Scribe ingest system, Flume should use ScribeSource based on Thrift with compatible transfering protocol.

sink接口适配

Sink Type	Comments
HDFS Sink	This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types.
Hive Sink	This sink streams events containing delimited text or JSON data directly into a Hive table or partition. Events are written using Hive transactions.
Logger Sink	Logs event at INFO level. Typically useful for testing/debugging purpose.
Avro Sink	This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Avro events and sent to the configured hostname / port pair.
Thrift Sink	This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Thrift events and sent to the configured hostname / port pair.
IRC Sink	The IRC sink takes messages from attached channel and relays those to configured IRC destinations.
File Roll Sink	Stores events on the local filesystem.
Null Sink	Discards all events it receives from the channel.
HBaseSink	This sink writes data to HBase. The Hbase configuration is picked up from the first hbase-site.xml encountered in the classpath. A class implementing HbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and/or increments.
HBase2Sink	HBase2Sink is the equivalent of HBaseSink for HBase version 2.
AsyncHBaseSink	This sink writes data to HBase using an asynchronous model.
MorphlineSolrSink	This sink extracts data from Flume events, transforms it, and loads it in near-real-time into Apache Solr servers, which in turn serve queries to end users or search applications.
ElasticSearchSink	This sink writes data to an elasticsearch cluster. By default, events will be written so that the Kibana graphical interface can display them - just as if logstash wrote them.
Kafka Sink	This is a Flume Sink implementation that can publish data to a Kafka topic.
HTTP Sink	Behaviour of this sink is that it will take events from the channel, and send those events to a remote service using an HTTP POST request. The event content is sent as the POST body.
Custom Sink	A custom sink is your own implementation of the Sink interface. A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent.

由上可见，Flume支持的接口比较丰富，最常用的基于文件的日志收集source以及同步到kafka的sink。

处理能力，处理能力和机器性能和数据都有关，在考虑的时候，既需要考虑每秒多少条数据，也需要考虑每秒多少兆数据。通常，在以file作为channel的时候，Flume可以支持每秒几十兆的数据处理，以memory作为channel的时候，可以支持每秒几百兆的数据处理。

灵活性，Flume的灵活性还是不错的，在数据处理的各个环节都预留有接口，方便进行个性化开发，再加上Flume本身也是java语言开发的，就更友好一些。

消耗，Flume本身的资源消耗还是比较多的，如果对资源消耗敏感，经过参数调优之后，使用的资源能够降低不少。

维护性，Flume的依赖不多，jar包下载既可用，有人觉得配置起来比较麻烦，不过灵活性带来的就是复杂性，个人感觉还好。

可靠性，Flume的可靠性在几个方面都有体现，首先需要选择合适的channel，来保证消息处理的可靠性，其次Flume自身还待遇LB的功能。

http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

https://blog.csdn.net/lijinqi1987/article/details/77506034

二、Logstash

Logstash本身是作为ELK的一员存在的，负责数据摄入，后来慢慢的也接入了更多的数据源和数据端。

Input接口

Input Plugin	Description
azure_event_hubs	Receives events from Azure Event Hubs
beats	Receives events from the Elastic Beats framework
cloudwatch	Pulls events from the Amazon Web Services CloudWatch API
couchdb_changes	Streams events from CouchDB’s _changes URI
dead_letter_queue	read events from Logstash’s dead letter queue
elasticsearch	Reads query results from an Elasticsearch cluster
exec	Captures the output of a shell command as an event
file	Streams events from files
ganglia	Reads Ganglia packets over UDP
gelf	Reads GELF-format messages from Graylog2 as events
generator	Generates random log events for test purposes
github	Reads events from a GitHub webhook
google_cloud_storage	Extract events from files in a Google Cloud Storage bucket
google_pubsub	Consume events from a Google Cloud PubSub service
graphite	Reads metrics from the graphite tool
heartbeat	Generates heartbeat events for testing
http	Receives events over HTTP or HTTPS
http_poller	Decodes the output of an HTTP API into events
imap	Reads mail from an IMAP server
irc	Reads events from an IRC server
java_generator	Generates synthetic log events
java_stdin	Reads events from standard input
jdbc	Creates events from JDBC data
jms	Reads events from a Jms Broker
jmx	Retrieves metrics from remote Java applications over JMX
kafka	Reads events from a Kafka topic
kinesis	Receives events through an AWS Kinesis stream
log4j	Reads events over a TCP socket from a Log4j SocketAppender object
lumberjack	Receives events using the Lumberjack protocl
meetup	Captures the output of command line tools as an event
pipe	Streams events from a long-running command pipe
puppet_facter	Receives facts from a Puppet server
rabbitmq	Pulls events from a RabbitMQ exchange
redis	Reads events from a Redis instance
relp	Receives RELP events over a TCP socket
rss	Captures the output of command line tools as an event
s3	Streams events from files in a S3 bucket
s3_sns_sqs	Reads logs from AWS S3 buckets using sqs
salesforce	Creates events based on a Salesforce SOQL query
snmp	Polls network devices using Simple Network Management Protocol (SNMP)
snmptrap	Creates events based on SNMP trap messages
sqlite	Creates events based on rows in an SQLite database
sqs	Pulls events from an Amazon Web Services Simple Queue Service queue
stdin	Reads events from standard input
stomp	Creates events received with the STOMP protocol
syslog	Reads syslog messages as events
tcp	Reads events from a TCP socket
twitter	Reads events from the Twitter Streaming API
udp	Reads events over UDP
unix	Reads events over a UNIX socket
varnishlog	Reads from the varnish cache shared memory log
websocket	Reads events from a websocket
wmi	Creates events based on the results of a WMI query
xmpp	Receives events over the XMPP/Jabber protocol

可以看到Logstash除了支持file，stdout，kafka等常规的input之外，还支持很多乱七八糟的input，这说明其作为一个ELK的数据摄入是合格的，但是否是我们需要的，则要仔细评估。

Output接口

Output Plugin	Description
boundary	Sends annotations to Boundary based on Logstash events
circonus	Sends annotations to Circonus based on Logstash events
cloudwatch	Aggregates and sends metric data to AWS CloudWatch
csv	Writes events to disk in a delimited format
datadog	Sends events to DataDogHQ based on Logstash events
datadog_metrics	Sends metrics to DataDogHQ based on Logstash events
elastic_app_search	Sends events to the Elastic App Search solution
elasticsearch	Stores logs in Elasticsearch
email	Sends email to a specified address when output is received
exec	Runs a command for a matching event
file	Writes events to files on disk
ganglia	Writes metrics to Ganglia’s gmond
gelf	Generates GELF formatted output for Graylog2
google_bigquery	Writes events to Google BigQuery
google_cloud_storage	Uploads log events to Google Cloud Storage
google_pubsub	Uploads log events to Google Cloud Pubsub
graphite	Writes metrics to Graphite
graphtastic	Sends metric data on Windows
http	Sends events to a generic HTTP or HTTPS endpoint
influxdb	Writes metrics to InfluxDB
irc	Writes events to IRC
java_sink	Discards any events received
java_stdout	Prints events to the STDOUT of the shell
juggernaut	Pushes messages to the Juggernaut websockets server
kafka	Writes events to a Kafka topic
librato	Sends metrics, annotations, and alerts to Librato based on Logstash events
loggly	Ships logs to Loggly
lumberjack	Sends events using the lumberjack protocol
metriccatcher	Writes metrics to MetricCatcher
mongodb	Writes events to MongoDB
nagios	Sends passive check results to Nagios
nagios_nsca	Sends passive check results to Nagios using the NSCA protocol
opentsdb	Writes metrics to OpenTSDB
pagerduty	Sends notifications based on preconfigured services and escalation policies
pipe	Pipes events to another program’s standard input
rabbitmq	Pushes events to a RabbitMQ exchange
redis	Sends events to a Redis queue using the RPUSH command
redmine	Creates tickets using the Redmine API
riak	Writes events to the Riak distributed key/value store
riemann	Sends metrics to Riemann
s3	Sends Logstash events to the Amazon Simple Storage Service
sns	Sends events to Amazon’s Simple Notification Service
solr_http	Stores and indexes logs in Solr
sqs	Pushes events to an Amazon Web Services Simple Queue Service queue
statsd	Sends metrics using the statsd network daemon
stdout	Prints events to the standard output
stomp	Writes events using the STOMP protocol
syslog	Sends events to a syslog server
tcp	Writes events over a TCP socket
timber	Sends events to the Timber.io logging service
udp	Sends events over UDP
webhdfs	Sends Logstash events to HDFS using the webhdfs REST API
websocket	Publishes messages to a websocket
xmpp	Posts events over XMPP
zabbix	Sends events to a Zabbix server

与Input类似，Output首先是一大批ES自己的东西，对于Hadoop系统的支持本身比较少，但在一些NoSQL数据库方面的支持，相对多一些，比如Redis、MongoDB等。

处理能力，Logstash处理能力在每秒几千条的规模上。

灵活性，Logstash提供了强大的数据过滤和预处理能力。

消耗，Logstash对资源的要求比较高，需要比较多的内存资源。

运维，Logstash本身以JRuby写成，依赖和配置的复杂度比较高。

可靠，Logstash是单机运行，极端情况下存在丢数据的可能。

https://www.elastic.co/guide/en/logstash/current/index.html

https://doc.yonyoucloud.com/doc/logstash-best-practice-cn/index.html

三、FileBeat

FileBeat也是ES推出的数据收集的技术框架，相对于Logstash而言，支持数据处理的能力要弱一些，不过这正是其目的——轻量化，资源消耗就很低。

Input接口

Input type	Comments
Log	read lines from log files.
Stdin	read events from standard in.
Container	read containers log files.
Kafka	read from topics in a Kafka cluster.
Redis	read entries from Redis slowlogs.
UDP	read events over UDP.
Docker	read logs from Docker containers.
TCP	read events over TCP.
Syslog	read events over TCP or UDP, this input will parse BSD (rfc3164) event and some variant.
s3	retrieve logs from S3 objects that are pointed by messages from specific SQS queues.
NetFlow	read NetFlow and IPFIX exported flows and options records over UDP.
Google Pub/Sub	read messages from a Google Cloud Pub/Sub topic subscription.
Azure eventhub	read messages from an azure eventhub.

可以看到FileBeat支持的Input种类比较少，但是常规的文件、标准输出、Kafka等都支持，另外对特定产品的支持还是有的。

Output接口

Output type	Comments
Elasticsearch	sends the transactions directly to Elasticsearch by using the Elasticsearch HTTP API.
Logstash	sends events directly to Logstash by using the lumberjack protocol, which runs over TCP. Logstash allows for additional processing and routing of generated events.
Kafka	sends the events to Apache Kafka.
Redis	inserts the events into a Redis list or a Redis channel.
File	dumps the transactions into a file where each transaction is in a JSON format.
Console	writes events in JSON format to stdout.

由以上组件接口可以看出，FileBeat在支持持久化方面还是比较弱的，由于本身孵化自ES，所以持久化的大部分为ES系组件，但对Kafka和Redis的支持，也能满足一定的需求。

处理能力，FileBeat的处理能力相对一般，满足基本的需求可以，类似每秒几千条数据，如果数据量再多，就需要特殊处理。

灵活性，FileBeat提供了一定的灵活性，不过GO语言本身就有门槛，不如Flume和Logstash灵活性和处理能力强。

消耗，由于以GO编写，所以正常情况下，资源消耗不多，但当遇到event的消息比较大时，在默认配置下容易出现OOM的情况。

运维，FileBeat本身是用GO写的，所以没有额外的依赖，但配置文件采用类YAML格式，可配的内容还是比较多的。

可靠，没看到FileBeat本身对数据传输过程本身防丢失采取的策略，所以极端情况下，存在丢数据的可能。

https://www.elastic.co/guide/en/beats/filebeat/current/index.html

以上是关于大数据之数据收集的主要内容，如果未能解决你的问题，请参考以下文章

大数据集群组件之Flume的配置及其使用

大数据技术之_18_大数据离线平台_02_Nginx+Mysql+数据收集+Web 工程 JS/JAVA SDK 讲解+Flume 故障后-如何手动上传 Nginx 日志文件至 HDFS 上(示例代码

如何编写一个网络数据收集器？

大数据分析-web图表展示-收集

收集各大互联网公司大数据平台架构

「大数据程序员开发工具」日志收集系统——Flume的功能与架构