大数据之数据收集

Posted boiledwater

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据之数据收集相关的知识,希望对你有一定的参考价值。

大数据之数据收集

 

数据收集是大数据的基础。散落在各处的数据,只有经过了数据收集,才会集中起来,提供了后续处理的可能。从大数据技术发展以来,出现了很多数据收集的技术框架,本文试图在若干流行的数据收集解决方案上加以叙述。

评估一个技术框架是否适合某个业务场景,通常需要考虑多个方面。

最基本的,考虑接口是否适配,收集socket数据了还是log数据,输出到哪里;

l 考虑技术框架的性能,是否满足业务的需求;

l 还需要考虑灵活性,如果需要做一些过滤或者自定义开发,是否容易;

l 考虑对性能的影响,数据收集不能影响了业务系统本身的运行,不能资源消耗太大;

l 考虑运维的难易程度,有的技术方案依赖很多,配置很复杂,就容易出错;

l 考虑技术框架是否高可靠,不会出现丢数据的情况。

相信通过上面几个方面的判断,应该可以找到合适的技术框架。

一、Flume

Flume是一种分布式,可靠且可用的服务,用于有效地收集,聚合和移动大量流事件数据。

source接口适配

Source Type

Comments

Avro Source

Listens on Avro port and receives events from external Avro client streams.

Thrift Source

Listens on Thrift port and receives events from external Thrift client streams.

Exec Source

Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless property logStdErr is set to true).

JMS Source

JMS Source reads messages from a JMS destination such as a queue or topic.

SSL and JMS Source

MS client implementations typically support to configure SSL/TLS via some Java system properties defined by JSSE (Java Secure Socket Extension).

Spooling Directory Source

This source lets you ingest data by placing files to be ingested into a spoolingdirectory on disk.

Taildir Source

Watch the specified files, and tail them in nearly real-time once detected new lines appended to the each files.

Kafka Source

Kafka Source is an Apache Kafka consumer that reads messages from Kafka topics.

NetCat TCP Source

A netcat-like source that listens on a given port and turns each line of text into an event.

NetCat UDP Source

As per the original Netcat (TCP) source, this source that listens on a given port and turns each line of text into an event and sent via the connected channel. Acts like nc -u -k -l [host] [port].

Sequence Generator Source

A simple sequence generator that continuously generates events with a counter that starts from 0, increments by 1 and stops at totalEvents.

Syslog Sources

Reads syslog data and generate Flume events.

Syslog TCP Source

The original, tried-and-true syslog TCP source.

Multiport Syslog TCP Source

his is a newer, faster, multi-port capable version of the Syslog TCP source.

Syslog UDP Source

 

HTTP Source

A source which accepts Flume Events by HTTP POST and GET.

Stress Source

StressSource is an internal load-generating source implementation which is very useful for stress tests.

Custom Source

A custom source is your own implementation of the Source interface.

Scribe Source

Scribe is another type of ingest system. To adopt existing Scribe ingest system, Flume should use ScribeSource based on Thrift with compatible transfering protocol.

 

 

 

sink接口适配

Sink Type

Comments

HDFS Sink

This sink writes events into the Hadoop Distributed File System (HDFS). It currently supports creating text and sequence files. It supports compression in both file types.

Hive Sink

This sink streams events containing delimited text or JSON data directly into a Hive table or partition. Events are written using Hive transactions.

Logger Sink

Logs event at INFO level. Typically useful for testing/debugging purpose.

Avro Sink

This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Avro events and sent to the configured hostname / port pair.

Thrift Sink

This sink forms one half of Flume’s tiered collection support. Flume events sent to this sink are turned into Thrift events and sent to the configured hostname / port pair.

IRC Sink

The IRC sink takes messages from attached channel and relays those to configured IRC destinations.

File Roll Sink

Stores events on the local filesystem.

Null Sink

Discards all events it receives from the channel.

HBaseSink

This sink writes data to HBase. The Hbase configuration is picked up from the first hbase-site.xml encountered in the classpath. A class implementing HbaseEventSerializer which is specified by the configuration is used to convert the events into HBase puts and/or increments.

HBase2Sink

HBase2Sink is the equivalent of HBaseSink for HBase version 2.

AsyncHBaseSink

This sink writes data to HBase using an asynchronous model.

MorphlineSolrSink

This sink extracts data from Flume events, transforms it, and loads it in near-real-time into Apache Solr servers, which in turn serve queries to end users or search applications.

ElasticSearchSink

This sink writes data to an elasticsearch cluster. By default, events will be written so that the Kibana graphical interface can display them - just as if logstash wrote them.

Kafka Sink

This is a Flume Sink implementation that can publish data to a Kafka topic.

HTTP Sink

Behaviour of this sink is that it will take events from the channel, and send those events to a remote service using an HTTP POST request. The event content is sent as the POST body.

Custom Sink

A custom sink is your own implementation of the Sink interface. A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent.

由上可见,Flume支持的接口比较丰富,最常用的基于文件的日志收集source以及同步到kafkasink

处理能力,处理能力和机器性能和数据都有关,在考虑的时候,既需要考虑每秒多少条数据,也需要考虑每秒多少兆数据。通常,在以file作为channel的时候,Flume可以支持每秒几十兆的数据处理,以memory作为channel的时候,可以支持每秒几百兆的数据处理。

灵活性Flume的灵活性还是不错的,在数据处理的各个环节都预留有接口,方便进行个性化开发,再加上Flume本身也是java语言开发的,就更友好一些。

消耗Flume本身的资源消耗还是比较多的,如果对资源消耗敏感,经过参数调优之后,使用的资源能够降低不少。

维护性Flume的依赖不多,jar包下载既可用,有人觉得配置起来比较麻烦,不过灵活性带来的就是复杂性,个人感觉还好。

可靠Flume的可靠性在几个方面都有体现,首先需要选择合适的channel,来保证消息处理的可靠性,其次Flume自身还待遇LB的功能。

http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html

https://blog.csdn.net/lijinqi1987/article/details/77506034

 

二、Logstash

Logstash本身是作为ELK的一员存在的,负责数据摄入,后来慢慢的也接入了更多的数据源和数据端。

Input接口

Input Plugin

Description

azure_event_hubs

Receives events from Azure Event Hubs

beats

Receives events from the Elastic Beats framework

cloudwatch

Pulls events from the Amazon Web Services CloudWatch API

couchdb_changes

Streams events from CouchDB’s _changes URI

dead_letter_queue

read events from Logstash’s dead letter queue

elasticsearch

Reads query results from an Elasticsearch cluster

exec

Captures the output of a shell command as an event

file

Streams events from files

ganglia

Reads Ganglia packets over UDP

gelf

Reads GELF-format messages from Graylog2 as events

generator

Generates random log events for test purposes

github

Reads events from a GitHub webhook

google_cloud_storage

Extract events from files in a Google Cloud Storage bucket

google_pubsub

Consume events from a Google Cloud PubSub service

graphite

Reads metrics from the graphite tool

heartbeat

Generates heartbeat events for testing

http

Receives events over HTTP or HTTPS

http_poller

Decodes the output of an HTTP API into events

imap

Reads mail from an IMAP server

irc

Reads events from an IRC server

java_generator

Generates synthetic log events

java_stdin

Reads events from standard input

jdbc

Creates events from JDBC data

jms

Reads events from a Jms Broker

jmx

Retrieves metrics from remote Java applications over JMX

kafka

Reads events from a Kafka topic

kinesis

Receives events through an AWS Kinesis stream

log4j

Reads events over a TCP socket from a Log4j SocketAppender 

object

lumberjack

Receives events using the Lumberjack protocl

meetup

Captures the output of command line tools as an event

pipe

Streams events from a long-running command pipe

puppet_facter

Receives facts from a Puppet server

rabbitmq

Pulls events from a RabbitMQ exchange

redis

Reads events from a Redis instance

relp

Receives RELP events over a TCP socket

rss

Captures the output of command line tools as an event

s3

Streams events from files in a S3 bucket

s3_sns_sqs

Reads logs from AWS S3 buckets using sqs

salesforce

Creates events based on a Salesforce SOQL query

snmp

Polls network devices using Simple Network Management Protocol (SNMP)

snmptrap

Creates events based on SNMP trap messages

sqlite

Creates events based on rows in an SQLite database

sqs

Pulls events from an Amazon Web Services Simple Queue Service queue

stdin

Reads events from standard input

stomp

Creates events received with the STOMP protocol

syslog

Reads syslog messages as events

tcp

Reads events from a TCP socket

twitter

Reads events from the Twitter Streaming API

udp

Reads events over UDP

unix

Reads events over a UNIX socket

varnishlog

Reads from the varnish cache shared memory log

websocket

Reads events from a websocket

wmi

Creates events based on the results of a WMI query

xmpp

Receives events over the XMPP/Jabber protocol

可以看到Logstash除了支持filestdoutkafka等常规的input之外,还支持很多乱七八糟的input,这说明其作为一个ELK的数据摄入是合格的,但是否是我们需要的,则要仔细评估。

Output接口

Output Plugin

Description

boundary

Sends annotations to Boundary based on Logstash events

circonus

Sends annotations to Circonus based on Logstash events

cloudwatch

Aggregates and sends metric data to AWS CloudWatch

csv

Writes events to disk in a delimited format

datadog

Sends events to DataDogHQ based on Logstash events

datadog_metrics

Sends metrics to DataDogHQ based on Logstash events

elastic_app_search

Sends events to the Elastic App Search solution

elasticsearch

Stores logs in Elasticsearch

email

Sends email to a specified address when output is received

exec

Runs a command for a matching event

file

Writes events to files on disk

ganglia

Writes metrics to Ganglia’s gmond

gelf

Generates GELF formatted output for Graylog2

google_bigquery

Writes events to Google BigQuery

google_cloud_storage

Uploads log events to Google Cloud Storage

google_pubsub

Uploads log events to Google Cloud Pubsub

graphite

Writes metrics to Graphite

graphtastic

Sends metric data on Windows

http

Sends events to a generic HTTP or HTTPS endpoint

influxdb

Writes metrics to InfluxDB

irc

Writes events to IRC

java_sink

Discards any events received

java_stdout

Prints events to the STDOUT of the shell

juggernaut

Pushes messages to the Juggernaut websockets server

kafka

Writes events to a Kafka topic

librato

Sends metrics, annotations, and alerts to Librato based on Logstash events

loggly

Ships logs to Loggly

lumberjack

Sends events using the lumberjack protocol

metriccatcher

Writes metrics to MetricCatcher

mongodb

Writes events to MongoDB

nagios

Sends passive check results to Nagios

nagios_nsca

Sends passive check results to Nagios using the NSCA protocol

opentsdb

Writes metrics to OpenTSDB

pagerduty

Sends notifications based on preconfigured services and escalation policies

pipe

Pipes events to another program’s standard input

rabbitmq

Pushes events to a RabbitMQ exchange

redis

Sends events to a Redis queue using the RPUSH command

redmine

Creates tickets using the Redmine API

riak

Writes events to the Riak distributed key/value store

riemann

Sends metrics to Riemann

s3

Sends Logstash events to the Amazon Simple Storage Service

sns

Sends events to Amazon’s Simple Notification Service

solr_http

Stores and indexes logs in Solr

sqs

Pushes events to an Amazon Web Services Simple Queue Service queue

statsd

Sends metrics using the statsd network daemon

stdout

Prints events to the standard output

stomp

Writes events using the STOMP protocol

syslog

Sends events to a syslog server

tcp

Writes events over a TCP socket

timber

Sends events to the Timber.io logging service

udp

Sends events over UDP

webhdfs

Sends Logstash events to HDFS using the webhdfs REST API

websocket

Publishes messages to a websocket

xmpp

Posts events over XMPP

zabbix

Sends events to a Zabbix server

Input类似,Output首先是一大批ES自己的东西,对于Hadoop系统的支持本身比较少,但在一些NoSQL数据库方面的支持,相对多一些,比如RedisMongoDB等。

 

处理能力,Logstash处理能力在每秒几千条的规模上。

灵活性Logstash提供了强大的数据过滤和预处理能力。

消耗Logstash对资源的要求比较高,需要比较多的内存资源。

运维,Logstash本身以JRuby写成,依赖和配置的复杂度比较高。

可靠,Logstash是单机运行,极端情况下存在丢数据的可能。

https://www.elastic.co/guide/en/logstash/current/index.html

https://doc.yonyoucloud.com/doc/logstash-best-practice-cn/index.html

 

三、FileBeat

FileBeat也是ES推出的数据收集的技术框架,相对于Logstash而言,支持数据处理的能力要弱一些,不过这正是其目的——轻量化,资源消耗就很低。

Input接口

Input type

Comments

Log

read lines from log files.

Stdin

read events from standard in.

Container

read containers log files.

Kafka

read from topics in a Kafka cluster.

Redis

read entries from Redis slowlogs.

UDP

read events over UDP.

Docker

read logs from Docker containers.

TCP

read events over TCP.

Syslog

read events over TCP or UDP, this input will parse BSD (rfc3164) event and some variant.

s3

retrieve logs from S3 objects that are pointed by messages from specific SQS queues.

NetFlow

read NetFlow and IPFIX exported flows and options records over UDP.

Google Pub/Sub

read messages from a Google Cloud Pub/Sub topic subscription.

Azure eventhub

read messages from an azure eventhub.

可以看到FileBeat支持的Input种类比较少,但是常规的文件、标准输出、Kafka等都支持,另外对特定产品的支持还是有的。

Output接口

Output type

Comments

Elasticsearch

sends the transactions directly to Elasticsearch by using the Elasticsearch HTTP API.

Logstash

sends events directly to Logstash by using the lumberjack protocol, which runs over TCP. Logstash allows for additional processing and routing of generated events.

Kafka

sends the events to Apache Kafka.

Redis

inserts the events into a Redis list or a Redis channel.

File

dumps the transactions into a file where each transaction is in a JSON format.

Console

writes events in JSON format to stdout.

由以上组件接口可以看出,FileBeat在支持持久化方面还是比较弱的,由于本身孵化自ES,所以持久化的大部分为ES系组件,但对KafkaRedis的支持,也能满足一定的需求。

 

处理能力,FileBeat的处理能力相对一般,满足基本的需求可以,类似每秒几千条数据,如果数据量再多,就需要特殊处理。

灵活性FileBeat提供了一定的灵活性,不过GO语言本身就有门槛,不如FlumeLogstash灵活性和处理能力强。

消耗,由于以GO编写,所以正常情况下,资源消耗不多,但当遇到event的消息比较大时,在默认配置下容易出现OOM的情况。

运维,FileBeat本身是用GO写的,所以没有额外的依赖,但配置文件采用类YAML格式,可配的内容还是比较多的。

可靠,没看到FileBeat本身对数据传输过程本身防丢失采取的策略,所以极端情况下,存在丢数据的可能。

https://www.elastic.co/guide/en/beats/filebeat/current/index.html

以上是关于大数据之数据收集的主要内容,如果未能解决你的问题,请参考以下文章

大数据集群组件之Flume的配置及其使用

大数据技术之_18_大数据离线平台_02_Nginx+Mysql+数据收集+Web 工程 JS/JAVA SDK 讲解+Flume 故障后-如何手动上传 Nginx 日志文件至 HDFS 上(示例代码

如何编写一个网络数据收集器?

大数据分析-web图表展示-收集

收集各大互联网公司大数据平台架构

「大数据程序员开发工具」日志收集系统——Flume的功能与架构