解读主流大数据架构
Posted Data+Science+Insight
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了解读主流大数据架构相关的知识,希望对你有一定的参考价值。
解读主流大数据架构
文章《对比解读五种主流大数据架构的数据分析能力》,文中详细总结了各类数据架构的应用以及原理。作为一名在数据仓库耕耘多年的技术人员,对于其中的一些技术细节还是破解兴趣的,所以随着作者的思路写下了我对主流数据架构的理解(如无特殊说明,以下涉及到这篇文章一律用《主流大数据架构》来代替)。
1. 文章阐述的传统 BI 架构,偏重于 Cube 的实现。
个人认为比较片面,作者所用图例和数据仓库圣经《Dimensional Modeling 》即中文版的《维度建模》相差太大。数据加载和数据转换放在同一层,和 ETL 有些违和。ETL 的步骤是 Extract, Transform, Load, 所以放在同一层来讲,不能很好的体现先后顺序,容易让刚开始实施数据仓库建设的读者引起迷糊。先转换后装载,或许之后还有第二层转换,第三层转换,这些步骤其实应该是作为交叉循环进行的。
数据仓库分两种大架构, Inmon 的数据集市和 Kimball 提倡的集中式数据仓库。数据集市是将数据分为各类主题,回流到各个业务部门,以提供信息检索。集中式数据仓库则是将所有主题融合到一起,做出更多联合性的分析,而在这前,通过数据操作层(OPDS,Open data processing services,开放数据处理服务)已经采用雪花模型将各个业务系统数据加载到缓冲层,业务系统可以在这里采集到聚合信息。
作者的原图没有充分体现出 kimball 和 Inmon 两大数据仓库的特征,所以我重新在网上找了个图,方便理解。其实 Cube 只是一种 MOLAP(Multidimension OLAP) 的实现,属于数据仓库的一部分。
2. 以 Hadoop 为首的大数据平台来替换传统数据仓库平台
简要的说下分布式计算平台比传统构建在商业数据库平台上的数据仓库的优势:
2.1 分布式计算:通过将数据计算分配到离数据最近的存储节点上,使得并行计算成为可能。
2.2 分布式存储:将大份数据,拆解为小份数据并分散存储到不同的存储节点,提供分布式计算的前提条件。
2.3 数据路由:分区分库分表等分布式存储操作之后,记录这些结构信息,并做高可用管理,提供给应用程序的是路由功能。使得应用系统进来的查询请求得以分配到合理的数据节点上计算。
而这一切在 oracle, sql server, mysql, postgresql 上是很难快速得以部署的。小规模 5-10 台还能接受,100台以上集群,管理难度和成本会急剧加速。
我认为构建在商业数据库平台上的数据仓库其实没有必要重新推翻,用 Hadoop 来重新做一遍,这一点和作者想法不一致。
e) 根据需求再将其他主题相关建模以及计算,构建到新的分布式系统中。
借用作者的图,我们可以用 数据仓库 + hadoop 分布式 实现 结果存储+搜索引擎,数据仓库和hadoop分布式之间用 sqoop 来做传输的通道。实现分布式算力的回流,而展现分析工作等依旧可以选择链接 数据仓库。一些即席分析(Ad Hoc Query) 需要大量的计算,那么可以直接链接 Hadoop 分布式系统。
在传统大数据架构的基础上,流式架构非常激进,直接拔掉了批处理,数据全程以流的形式处理,所以在数据接入端没有了ETL,转而替换为数据通道。经过流处理加工后的数据,以消息的形式直接推送给了消费者。虽然有一个存储部分,但是该存储更多以窗口的形式进行存储,所以该存储并非发生在数据湖,而是在外围系统。
A data lake is a system or repository of data stored in its natural format,usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
节选自维基百科 https://en.wikipedia.org/wiki/Data_lake
通常存储原始格式数据,比如 blob 对象或文件的系统,被称为数据湖(Data Lake). 在这套系统里面,存储了所有企业的数据,不仅仅是原始应用系统数据,还包括了用于报表,可视化分析和机器学习的转化过后的数据。因此,它包含了各种数据格式,有关系型数据库的结构化数据,有半结构化数据(比如 CSV, log, XML ,Json),还有非结构化数据(email, document, PDF )和二进制数据(图片,音频和视频)。
原文中作者对消息的存储,并没有给出一个合理的解释,并将消息和数据湖区分开来,我认为是不妥的。从维基百科来的解释,一切的数据存储都是归档在数据湖里的,至少消息我认为也应该算是数据湖的一部分。
上面的图很难理解,而且对 Lambda Architecture 的三要素也没有很好的展示,所以有必要解释下 Lambda 的要素以及贴一下公认的 Lambda 架构图。
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce.
Lambda architecture depends on a data model with an append-only, immutable data source that serves as a system of record. It is intended for ingesting and processing timestamped events that are appended to existing events rather than overwriting them. State is determined from the natural time-based ordering of the data.
Lambda 是充分利用了 批次(batch) 处理和 流处理(stream-processing)各自强项的数据处理架构。它平衡了延迟,吞吐量和容错。利用批次处理生成正确且深度聚合的数据视图,同时借助流式处理方法提供在线数据分析。在展现数据之前,可以将两者结果融合在一起。Lambda 的出现,与大数据,实时数据分析以及map-reduce 低延迟的推动是密不可分的。
Lambda 依赖于只增不改的数据源。历史数据在这个模型中,就是稳定不变的历史数据,变化着的数据永远是最新进来的,并且不会重写历史数据。任何事件,对象等的状态和属性,都需要从有序的时间中推断出来。
因此在这套架构中,我们可以看到即时的数据,也可以看到历史的聚合数据。
Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries.[3]:13 The processing layers ingest from an immutable master copy of the entire data set.
Lambda 包含了三层:批次处理(batch), 高速处理也称为实时处理(speed or real-time),和响应查询的服务层(serving).
《主流大数据架构》中有一点特别之处在于,他将数据源描述的特别丰富,所以即使用传统的商业数据库也是可以实现 Lambda 架构的。Batch 用商业用的 ETL 工具,比如 SSIS, Informatic 等, Stream-Processing 用 Message Broker ,比如 RabbitMQ, ActiveMQ等。
但维基百科上可不是这么认为的,以 Hadoop 为代表的分布式系统,才有资格称得上是 Lambda 架构的组成。从它对三个元素的定义就可以看的出来:
The batch layer precomputes results using a distributed processing system that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. This means it can fix any errors by recomputing based on the complete data set, then updating existing views. Output is typically stored in a read-only database, with updates completely replacing existing precomputed views.
Apache Hadoop is the de facto standard batch-processing system used in most high-throughput architectures
批次处理层,利用分布式系统的强大处理能力,提供了可靠准确无误的数据视图。一旦发现错误的数据,即可重新计算全部的数据来获得最新的结果。Hadoop 被视为这类高吞吐架构的标准。
The speed layer processes data streams in real time and without the requirements of fix-ups or completeness. This layer sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. Essentially, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer's views for the same data become available.
Stream-processing technologies typically used in this layer include Apache Storm, SQLstream and Apache Spark. Output is typically stored on fast NoSQL databases
高速处理层,用实时计算的手段,将数据集成到存储端。这部分处理虽然没有最终的批次处理来的完整和精确,但弥补了批次处理的时效差的弱点。
通常使用 Apache Storm, SQLStream, Apache Spark 等,输出一般是到NoSQL 数据库
Output from the batch and speed layers are stored in the serving layer, which responds to ad-hoc queries by returning precomputed views or building views from the processed data.
Examples of technologies used in the serving layer include Druid, which provides a single cluster to handle output from both layers. Dedicated stores used in the serving layer include Apache Cassandra, Apache HBase, MongoDB, VoltDB or Elasticsearch for speed-layer output, and Elephant DB, Apache Impala or Apache Hive for batch-layer output.
这一层主要服务于终端数据消费者的即席查询和分析。当最终计算结果存储到本层的时候,就可以对外服务了。对于高速处理层来的结果,可以交由 Apache Cassandra, HBase, MongoDB, ElasticSearch 存储;对于批次处理层来的结果,可以交由 Apache Impala, Hive 存储。
ElasticSearch 提供全文索引, Impala 就是类似于 Cube 做分析应用的项目,因此《主流大数据架构》中提到 Cube 为数仓中心,其实不是很妥。依我看,更应该是各种数据衍生系统的大综合,既有传统意义上的数据仓库(Dimension/Fact), 更要有全文索引,OLAP 应用等。
Metamarkets, which provides analytics for companies in the programmatic advertising space, employs a version of the lambda architecture that uses Druid for storing and serving both the streamed and batch-processed data.
For running analytics on its advertising data warehouse, Yahoo has taken a similar approach, also using Apache Storm, Apache Hadoop, and Druid.
The Netflix Suro project has separate processing paths for data, but does not strictly follow lambda architecture since the paths may be intended to serve different purposes and not necessarily to provide the same type of views.https://en.wikipedia.org/wiki/Lambda_architecture#cite_note-netflix-10Nevertheless, the overall idea is to make selected real-time event data available to queries with very low latency, while the entire data set is also processed via a batch pipeline. The latter is intended for applications that are less sensitive to latency and require a map-reduce type of processing.
Metamarkets, 一家从事计算广告的公司,利用 Druid 列式存储技术搭建了 Lambda 架构,同时支持数据的批次处理与实时处理
Yahoo 利用了相同技术栈, Storm, Hadoop, Druid 搭建了 Lambda
Netflix Suro 项目隔离了离线数据与在线数据的处理路径。延迟要求不高的数据依然采用了 map-reduce 的处理类型。虽然没有严格遵循 Lambda ,但本质上无异。
我在网上看到和作者差不多的图,我觉得两者都没有将 Stream, Serving 层画出来,而且对 messaging 到 Raw Data Reserved 没有特别指出其数据流过程,很是费解。
所有数据都走实时路线,一切都是流。并且以数据湖作为最终存储目的地。事实上还是以 lambda为基础,只是将批次处理层(Batch Layer ) 去掉,剩下 Streaming, Serving 层。
以下引自 https://towardsdatascience.com/a-brief-introduction-to-two-data-processing-architectures-lambda-and-kappa-for-big-data-4f35c28005bb
Some variants of social network applications, devices connected to a cloud based monitoring system, Internet of things (IoT) use an optimized version of Lambda architecture which mainly uses the services of speed layer combined with streaming layer to process the data over the data lake.
Kappa architecture can be deployed for those data processing enterprise models where:
Multiple data events or queries are logged in a queue to be catered against a distributed file system storage or history.
The order of the events and queries is not predetermined. Stream processing platforms can interact with database at any time.
It is resilient and highly available as handling Terabytes of storage is required for each node of the system to support replication.
The above mentioned data scenarios are handled by exhausting Apache Kafka which is extremely fast, fault tolerant and horizontally scalable. It allows a better mechanism for governing the data-streams. A balanced control on the stream processors and databases makes it possible for the applications to perform as per expectations. Kafka retains the ordered data for longer durations and caters the analogous queries by linking them to the appropriate position of the retained log. LinkedIn and some other applications use this flavor of big data processing and reap the benefit of retaining large amount of data to cater those queries that are mere replica of each other.
物联网 (Internet of Things IoT) 对这种架构是毫无抵抗力的。因为闯红灯这件事,事后去分析或者告警,已经没有太大意义了。
Kafka 在这其中扮演了实时分发数据的角色,它的快速,容错和水平扩展能力都表现非常出色。
不得不说,《主流大数据架构》作者连 unified 都能写成 unifield, 也不知道他本人是否熟知 unified 是什么意思
Lambda 与身俱来带有很强的复杂性,为了克服这些复杂性,架构师们开始寻找各种各样的替换方案,但始终逃不开这三样:
1) 采用纯粹的流式处理方法,同时使用灵巧的架构(比如 Apache Samza)完成某种意义上的批次处理。Apache Samza 依赖于 Kafka, 因此可以完成可循环利用的分区,达成批次处理;
2)采用另一种极端的方案,同时用微批次(micro batches)来完成准实时的数据处理,例如 Spark 就是这种方式,它的批次间隔可以达到秒级。
3)Twitter 早在 2013年开源的 Summingbird 是一种同时支持批次处理与实时处理的框架,用 Scala API 封装所有的 Batch, Speed Layer 操作,使得 Batch Layer 运行在 Hadoop 之上,而 Speed Layer 运行在 Storm 之上,而这一些都是封装好的。
Lambdoop 也是同样的原理,同一个 API 封装了实时处理与批次处理。很不幸的是后者在 2017年9月已经关闭了项目。
The downside of λ is its inherent complexity. Keeping in sync two already complex distributed systems is quite an implementation and maintenance challenge. People have started to look for simpler alternatives that would bring just about the same benefits and handle the full problem set. There are basically three approaches:
1) Adopt a pure streaming approach, and use a flexible framework such as Apache Samza to provide some type of batch processing. Although its distributed streaming layer is pluggable, Samza typically relies on Apache Kafka. Samza’s streams are replayable, ordered partitions. Samza can be configured for batching, i.e. consume several messages from the same stream partition in sequence.
2) Take the opposite approach, and choose a flexible Batch framework that would also allow micro-batches, small enough to be close to real-time, with Apache Spark/Spark Streaming or Storm’s Trident. Spark streaming is essentially a sequence of small batch processes that can reach latency as low as one second.Trident is a high-level abstraction on top of Storm that can process streams as small batches as well as do batch aggregation.
3) Use a technology stack already combining batch and real-time, such as Spring “XD”, Summingbird or Lambdoop. Summingbird (“Streaming MapReduce”) is a hybrid system where both batch/real-time workflows can be run at the same time and the results merged automatically.The Speed layer runs on Storm and the Batch layer on Hadoop, Lambdoop (Lambda-Hadoop, with HBase, Storm and Redis) also combines batch/real-time by offering a single API for both processing paradigms:
The integrated approach (unified λ) seeks to handle Big Data’s Volume and Velocity by featuring a hybrid computation model, where both batch and real-time data processing are combined transparently. And with a unified framework, there would be only one system to learn, and one system to maintain.
综上,Lambda 架构是兼容了 batch layer, speed layer(real-time processing)的架构,Kappa 架构则是用 speed layer(real-time processing) 全程处理实时数据和历史数据,Unified 架构则是利用统一的 Api 框架,兼容了 batch layer, speed layer, 并且在操作 2 层的数据结果的时候,使用的也是同一套 API 框架。
值得一提的是 butterfly architecture, 他采用的便是 Unified Architecture的原型, 在 http://ampool.io 公司的网站介绍上看到:
https://www.ampool.io/emerging-data-architectures-lambda-kappa-and-butterfly
We note that the primary difficulty in implementing the speed, serving, and batch layers in the same unified architecture is due to the deficiencies of the distributed file system in the Hadoop ecosystem. If a storage component could replace or augment the HDFS to serve the speed and serving layers, while keeping data consistent with HDFS for batch processing, it could truly provide a unified data processing platform. This observation leads to the butterfly architecture.
在 Hadoop 生态圈系统中,由于分布式文件系统缺失对 speed, batch, serving 层的一致性支持,所以想要基于 Hadoop 做统一存储管理就比较困难。
The main differentiating characteristics of the butterfly architecture is the flexibility in computational paradigms on top of each of the above data abstractions. Thus a multitude of computational engines, such as MPP SQL-engines (Apache Impala, Apache Drill, or Apache HAWQ), MapReduce, Apache Spark, Apache Flink, Apache Hive, or Apache Tez can process various data abstractions, such as datasets, dataframes, and event streams. These computation steps can be strung together to form data pipelines, which are orchestrated by an external scheduler. A resource manager, associated with pluggable resource schedulers that are data aware, are a must for implementing the butterfly architecture. Both Apache YARN, and Apache Mesos, along with orchestration frameworks, such asKubernetes, or hybrid resource management frameworks, such as Apache Myriad, have emerged in the last few years to fulfill this role.
bufferfly architecture 的最大闪光点就是它能够基于每一层的存储做灵活的计算处理抽象,使得存和取都使用同一套软件框架。如此多样的计算引擎,比如 MPP SQL (Impala,Drill,HAWQ), MapReduce, Spark, Flink, Hive, Tez 可以同时访问这些存储抽象中的数据,比如 datasets, dataframes 和 event streams.
Datasets: partitioned collections, possibly distributed over multiple storage backends
DataFrames: structured datasets, similar to tables in RDBMS or NoSQL. Immutable dataframes are suited for analytical workloads, while mutable dataframes are suited for transactional CRUD workloads.
Event Steams: are unbounded dataframes, such as time series or sequences.
Datasets 就是分布式的数据集;DataFrames,是结构化的 DataSets, 与二维关系表或者NoSQL 的文档等类似, 易变的DataFrame用来支持 OLTP 应用,而不易变的DataFrame则用来支持决策性应用;Event Stream 是DataSets 的源头,也是 Publisher 与 Stream Processing 的下游。
以上参考这篇文章,我只是做了翻译,并没有亲自实现过类似的架构
https://hpi.de/fileadmin/user_upload/fachgebiete/plattner/teaching/Ringvorlesung/Master_Nils_Strelow.pdf
问题重点在于,为什么有了 Kappa 架构还需要 Unified Lambda 架构呢?
1. Kappa 架构使用实时处理技术,既满足了高速实时处理需求,还兼顾了批次处理的场景。在这种情况下, kappa 的缺陷是什么呢?
1.1 kappa 的批次处理能力不如 Lambda 架构下的 MapReduce ?
在 Lambda 架构下, MapReduce 的处理优势体现在存储和计算节点扩展容易,离线处理成功率高,而且每一步的 Map/Reduce 都有可靠的容错能力,在失效场景下恢复数据处理够快。而这一切对于实时处理程序 Storm/Flink/Spark 是不是就一定够慢或者可靠性就差呢,其实不一定,关键看怎么配置和管理集群。
1.2 Kappa 的批次处理能力,需要配置的硬件成本可能比纯粹的批次处理架构要高
这倒是可能的。基于内存计算的实时处理,占用的资源一定是比基于 Hadoop 的 Map/Reduce 模式要高。
2. Unified Lambda 架构可以提供一套统一的架构 API 来执行 Batch 和 Speed layer 的操作。
因此综合了 Kappa 的优势,即单一代码库,而且还克服了 Kappa 的劣势,即流式处理容易出错和高成本。
虽然 Unified Lambda (Hybrid Architecture )看上去面面俱到,但我还是看好Kappa, 且看微软发布的两张图,一张是 Lambda, 一张是 Kappa, 最简单的东西往往最高效!
既然 Batch Layer 已经可以被 Speed Layer 给兼容实现了,那么 Kappa 就已经拥有了最简单实用的核心,为什么还要费心去做额外的那些工作呢?
参考:常用的几种大数据架构剖析
以上是关于解读主流大数据架构的主要内容,如果未能解决你的问题,请参考以下文章