Hive, HBase, Hadoop, RDBMS – Who wins?

Posted 2021-04-19 银基富力

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Hive, HBase, Hadoop, RDBMS – Who wins?相关的知识，希望对你有一定的参考价值。

Hive and HBase are designed completely for different use cases.So there is no battle!

The beauty of today's complex systems is that there is space foreach and every technology. Let me try to explain it with an example:

When you open linkedin, you see hundreds of different things.e.g your profile attributes, your friend list, your skills, recommendations ofgroups for you, friend suggestions, recommendation of companies for you, whoviewed your profile etc.

Now linkedin has hundreds of millions of users. And the pageloads at lightening fast speed. Can you think of one technology doing all thisat the backend?

Can RDBMS do it all? NO, fetching so much information from a total dataset of so manyusers is not something where RDBMS can even come closer. you would be fetchingdata from so many tables and joining it and it would take ages for the page toload.

Can Hadoop do it all? NO, hadoop is known for analytics and supplying data at thisspeed is just not possible for hadoop. it has high latency. Even running asimple mapreduce job takes seconds to start even before it starts processingdata.

Can Hive do it all? NO, hive is nothing but a lens between data lying in hdfs andclient's eyes. And this lens makes the data lying in HDFS look like RDBMStables. Behind the scenes, it runs mapreduce jobs. In fact it would be slowerthan running direct map reduce jobs because hive first converts every queryinto a mapreduce job and then launches that job. So hive has same limitation ofhigh latency as that of map reduce.

Can HBase do it all? NO, HBase doesn't have analytical capabilities. So HBase can'tfind the recommendations for you.

If all these popular technologies can't do it then how is it allworking? Well, the answer is these technologies decided not to fight with eachother and work as a team. So all these technologies work together to give thatawesome experience to users that we all enjoy! To be precise for the givenexample, Recommendations is something that doesn't change every minute or everysecond. So you can pre compute the recommendations for all users. You stillneed high throughput while calculating recommendations but latency is fine. youjust need low latency while serving those pre computed recommendations tousers. So recommendation engine can be either HIVE or plain mapreduce. Yourprofile data is something that can keep changing so it needs a proper Databasebut something faster than rdbms. HBase plays the role of that database. Everyanalytical use case can be done using hive/mapreduce etc. results of thoseanalytics along with other information (profile) can be stored using HBase (toprovide fast random access). Even RDBMS is not useless, all the paymentgateways still use RDBMS for its high consistency & availability.

Sotechnology also teaches us that teamwork wins. "None of us is as smartas all of us"

Some Concepts Explained

Hadoop Distributed File System: HDFS,the storage layer of Hadoop, is a distributed, scalable, Java-based file systemadept at storing large volumes of unstructured data.

MapReduce: MapReduce is a software frameworkthat serves as the compute layer of Hadoop. MapReduce jobs are divided into two(obviously named) parts. The “Map” function divides a query into multiple partsand processes data at the node level. The “Reduce” function aggregates theresults of the “Map” function to determine the “answer” to the query.

Hive: Hive is a Hadoop-based data warehousing-likeframework originally developed by Facebook. It allows users to write queries ina SQL-like language caled HiveQL, which are then converted to MapReduce. Thisallows SQL programmers with no MapReduce experience to use the warehouse andmakes it easier to integrate with business intelligence and visualization toolssuch as Microstrategy, Tableau, Revolutions Analytics, etc.

Pig: Pig Latin is a Hadoop-based language developed byYahoo. It is relatively easy to learn and is adept at very deep, very long datapipelines (a limitation of SQL.)

HBase: HBase is a non-relational databasethat allows for low-latency, quick lookups in Hadoop. It adds transactionalcapabilities to Hadoop, allowing users to conduct updates, inserts and deletes.EBay and Facebook use HBase heavily.

Flume: Flume is a framework for populatingHadoop with data. Agents are populated throughout ones IT infrastructure –inside web servers, application servers and mobile devices, for example – tocollect data and integrate it into Hadoop.

Oozie: Oozie is a workflow processingsystem that lets users define a series of jobs written in multiple languages –such as Map Reduce, Pig and Hive -- then intelligently link them to oneanother. Oozie allows users to specify, for example, that a particular query isonly to be initiated after specified previous jobs on which it relies for dataare completed.

Ambari: Ambari is a web-based set of toolsfor deploying, administering and monitoring Apache Hadoop clusters. It'sdevelopment is being led by engineers from Hortonworoks, which include Ambariin its Hortonworks Data Platform.

Mahout: Mahout is a data mining library. Ittakes the most popular data mining algorithms for performing clustering,regression testing and statistical modeling and implements them using the MapReduce model.

Sqoop: Sqoop is a connectivity tool formoving data from non-Hadoop data stores – such as relational databases and datawarehouses – into Hadoop. It allows users to specify the target location insideof Hadoop and instruct Sqoop to move data from Oracle, Teradata or otherrelational databases to the target.

HCatalog: HCatalog is a centralized metadatamanagement and sharing service for Apache Hadoop. It allows for a unified viewof all data in Hadoop clusters and allows diverse tools, including Pig andHive, to process any data elements without needing to know physically where inthe cluster the data is stored.

Kylin: Apache Kylin is an open source DistributedAnalytics Engine designed to provide SQL interface and multi-dimensionalanalysis (OLAP) on Hadoop supporting extremely large datasets, originalcontributed from eBay Inc.

Zeppelin: Zeppelin is a web-based notebook thatenables interactive data analytics. You can make beautiful data-driven,interactive and collaborative documents with SQL, Scala and more. Zeppelininterpreter concept allows any language/data-processing-backend to be pluggedinto Zeppelin.

Presto: Presto is an open source distributed SQL query enginefor running interactive analytic queries against data sources of all sizesranging from gigabytes to petabytes. Presto allows querying data where itlives, including Hive, Cassandra, relational databases or even proprietary datastores. A single Presto query can combine data from multiple sources, allowingfor analytics across the entire organization.

Neo4j: Neo4j is a graph database management system developedby Neo Technology, Inc. An ACID-compliant transactional database with nativegraph storage and processing, Neo4j is the most popular graph database. Neo4jis implemented in Java and accessible from software written in other languagesusing the Cypher Query Language through a transactional HTTP endpoint.

以上是关于Hive, HBase, Hadoop, RDBMS – Who wins?的主要内容，如果未能解决你的问题，请参考以下文章