大数据存储引擎 NoSQL极简教程 An Introduction to Big Data: NoSQL

Posted 禅与计算机程序设计艺术

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了大数据存储引擎 NoSQL极简教程 An Introduction to Big Data: NoSQL相关的知识,希望对你有一定的参考价值。

本文路线图:

  • NoSQL简介

  • 文档数据库

  • 键值数据库

  • 图数据库

Here’s the roadmap for this fourth post on NoSQL database:

  • Introduction to NoSQL

  • Document Databases

  • Key-Value Databases

  • Graph Databases

1. NoSQL 简介

Both the SQL and NoSQL databases have their use cases across different applications and scenarios, so you may have to choose multiple databases to support your applications.

NoSQL 是一种数据库设计方法,可以适应各种数据模型,包括键值、文档、柱状和图形格式。NoSQL 代表“不仅是 SQL”,它是传统关系数据库的替代方案,在传统关系数据库中,数据放在表中,数据模式在构建数据库之前经过精心设计。NoSQL 数据库对于处理大量分布式数据特别有用。

Brief History of NoSQL Databases

  • 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational database

  • 2000- Graph database Neo4j is launched

  • 2004- Google BigTable is launched

  • 2005- CouchDB is launched

  • 2007- The research paper on Amazon Dynamo is released

  • 2008- Facebooks open sources the Cassandra project

  • 2009- The term NoSQL was reintroduced

NoSQL 术语可以应用于一些早于关系数据库管理系统 (RDBMS) 的数据库,但它更常见的是指 2000 年代初为云和 Web 应用程序中的大规模数据库集群而构建的数据库。在这些应用程序中,对性能和可伸缩性的要求超过了 RDBMS 为事务性企业应用程序提供的即时、严格的数据一致性的需求。

Before summing up, here is a picture to differentiate very briefly between Relational and Non-Relational databases :

NoSQL 有助于处理大数据的数据量、多样性和速度要求:

  • 数据量大:维护 ACID 属性(原子性、一致性、隔离性、持久性)是昂贵的,并不总是必要的。有时,我们可以处理结果中的细微不一致。因此,我们希望能够将我们的数据分区到多个站点。

  • 多样性:单一的固定数据模型使得合并变化的数据变得更加困难。有时,当我们从外部资源中提取数据时,我们并不知道架构!此外,更改关系数据库中的架构可能代价高昂。

  • 速度:始终将所有持久存储在磁盘上的成本可能高得令人望而却步。有时,如果我们丢失数据的可能性很小,那也没关系。内存现在便宜得多,而且比总是去磁盘快得多。

NoSQL helps deal with the volume, variety, and velocity requirements of big data:

  • Volume: Maintaining the ACID properties (Atomicity, Consistency, Isolation, Durability) is expensive and not always necessary. Sometimes, we can deal with minor inconsistencies in our results. We thus want to be able to partition our data multiple sites.

  • Variety: One single fixed data model makes it harder to incorporate varying data. Sometimes, when we pull from external sources, we don’t know the schema! Furthermore, changing a schema in a relational database can be expensive.

  • Velocity: Storing everything durable to a disk all the time can be prohibitively expensive. Sometimes it’s okay if we have a low probability of losing data. Memory is much cheaper now, and much faster than always going to disk.

NoSQL没有一个公认的定义,但它的主要特征如下:

  • 与关系模型不同,它具有相当灵活的模式。不同的行可能具有不同的属性或结构。数据库通常不感知数据模型模式。由应用程序来维护模式的一致性。

  • 它通常也更擅长处理真正的大数据任务。这是因为 NoSQL 数据库遵循BASE(基本可用、软状态、最终一致性)方法而不是 ACID。

  • 在 NoSQL 中,只有在写入停止一段时间后才能保证一致性。这意味着查询可能看不到最新数据。这通常是通过将数据存储在内存中然后延迟发送到其他机器来实现的。

  • 最后,还有一个称为CAP 定理的概念——从 3 件事中选择 2 件事:一致性、可用性和分区容错性。ACID 数据库通常是 CP 系统,而 BASE 数据库通常是 AP。这种区别是模糊的,通常可以重新配置系统以改变这些权衡。

我们将讨论不同类别的 NoSQL,包括文档数据库、键值数据库和图形数据库。

It has quite a flexible schema, unlike the relational model. Different rows may have different attributes or structure. The database often has no understanding of the schema. It is up to the applications to maintain consistency in the schema including any denormalization.

It also is often better at handling really big data tasks. This is because NoSQL databases follow the BASE (Basically Available, Soft state, Eventual consistency) approach instead of ACID.

In NoSQL, consistency is only guaranteed after some period of time when writes stop. This means it is possible that queries will not see the latest data. This is commonly implemented by storing data in memory and then lazily sending it to other machines.

Finally, there is this notion known as the CAP theorem — pick 2 out of 3 things: Consistency, Availability, and Partition tolerance. ACID databases are usually CP systems, while BASE databases are usually AP. This distinction is blurry and often systems can be reconfigured to change these tradeoffs.

We’ll discuss different categories of NoSQL, including document databases, key-value databases, and graph databases.

KV存储结构设计图

What is NoSQL?

NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed data stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.

NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be “NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL database system encompasses a wide range of database technologies that can store structured, semi-structured, unstructured and polymorphic data. Let’s understand about NoSQL with a diagram in this NoSQL database tutorial:

Why NoSQL?

The concept of NoSQL databases became popular with Internet giants like Google, Facebook, Amazon, etc. who deal with huge volumes of data. The system response time becomes slow when you use RDBMS for massive volumes of data.

To resolve this problem, we could “scale up” our systems by upgrading our existing hardware. This process is expensive.

The alternative for this issue is to distribute database load on multiple hosts whenever the load increases. This method is known as “scaling out.” NoSQL database is non-relational, so it scales out better than relational databases as they are designed with web applications in mind.

2.文档数据库

有许多不同的文档数据库系统,例如MongoDB、FoundationDB、RethinkDB、MarkLogic、ArangoDB……没有标准的系统;但是,它们都必须处理一种称为JSON的数据类型。它取自 javascript,在嵌套字典中包含对象、字符串、数字、数组、布尔值和空值。

最流行的文档数据库是MongoDB,它是开源的,将数据存储在灵活的、类似 JSON 的文档中,这意味着字段可能因文档而异,并且数据结构会随着时间的推移而改变。MongoDB 层次结构从数据库开始,然后是集合,然后是文档

下面的查询创建一个新的集合或视图。因为 MongoDB 在命令中首次引用集合时会隐式创建集合,所以此方法主要用于创建使用特定选项的新集合。例如,您使用db.createCollection()创建一个上限集合或创建一个使用文档验证的新集合。

MongoDB 还使用称为BSON的文档类型,这是一种二进制序列化格式,用于存储文档和进行远程过程调用。

您可以将这些值与 $type 运算符一起使用,以按 BSON 类型查询文档。$type 聚合运算符使用列出的 BSON 类型字符串之一返回运算符表达式的类型。

每个元素输出一个文档。每个输出文档都是输入文档,其中数组字段的值被元素替换。

3.键值数据库

键值数据库或键值存储是一种数据存储范例,设计用于存储、检索和管理关联数组,这是一种如今通常称为字典或哈希表的数据结构。字典包含对象或记录的集合,这些对象或记录又包含许多不同的字段,每个字段都包含数据。这些记录使用唯一标识记录的密钥进行存储和检索,并用于在数据库中快速查找数据。最近使用的一些流行的键值数据库正在重新发现,Amazon DynamoDB、Aerospace、RiakKV、ArangoDB 等。

在键值数据库中,对单个键值的更新通常是原子的。此外,许多键值数据库允许使用多个键的事务。此外,价值观的结构有限。

键值数据库的优点是:

  • 键值数据库通常更容易以分布式方式运行。

  • 查询和更新通常非常快。

  • 任何结构中的任何类型的数据都可以存储为值。

但是,键值数据库的缺点是:

  • 非常简单的查询(通常只得到一个给定键的值,有时是一个范围)。

  • 没有参照完整性。

  • 交易能力有限。

  • 没有理解数据的模式。

我们简单看一下最流行的键值数据库:Redis。Redis 基本上是一个巨大的分布式哈希表,值的结构很少。所有值都由一个键标识,该键是一个简单的字符串。如果我们想要在我们的键中有更多结构,它必须由我们的应用程序定义(例如,用户 3 可以有键“user:3”)。

Redis 值通常在键值存储中,值只是任意的数据块。Redis(和其他一些键值存储)允许一些结构:列表、集合和散列。

那么什么时候应该使用键值数据库呢?

  • 当你需要非常快的东西时。

  • 当您的数据没有很多结构/关系时。

  • 用于简单缓存从另一个来源提取的数据。

4.图形数据库

最后,让我们谈谈图形数据库。它们依赖于无限查询,即搜索条件不是特别具体,因此可能会返回非常大的结果集。没有 WHERE 子句的查询肯定属于这一类,但让我们考虑一下其他一些可能性。Neo4j是世界领先的图形数据库之一。

Features of NoSQL

Non-relational

  • NoSQL databases never follow the relational model

  • Never provide tables with flat fixed-column records

  • Work with self-contained aggregates or BLOBs

  • Doesn’t require object-relational mapping and data normalization

  • No complex features like query languages, query planners,referential integrity joins, ACID

Schema-free

  • NoSQL databases are either schema-free or have relaxed schemas

  • Do not require any sort of definition of the schema of the data

  • Offers heterogeneous structures of data in the same domain

NoSQL is Schema-Free

Simple API

  • Offers easy to use interfaces for storage and querying data provided

  • APIs allow low-level data manipulation & selection methods

  • Text-based protocols mostly used with HTTP REST with JSON

  • Mostly used no standard based NoSQL query language

  • Web-enabled databases running as internet-facing services

Distributed

  • Multiple NoSQL databases can be executed in a distributed fashion

  • Offers auto-scaling and fail-over capabilities

  • Often ACID concept can be sacrificed for scalability and throughput

  • Mostly no synchronous replication between distributed nodes Asynchronous Multi-Master Replication, peer-to-peer, HDFS Replication

  • Only providing eventual consistency

  • Shared Nothing Architecture. This enables less coordination and higher distribution.

NoSQL is Shared Nothing

Types of NoSQL Databases

NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented, Graph-based and Document-oriented. Every category has its unique attributes and limitations. None of the above-specified database is better to solve all the problems. Users should select the database based on their product needs.

Types of NoSQL Databases:

  1. Key-value Pair Based

  2. Column-oriented Graph

  3. Graphs based

  4. Document-oriented


Key Value Pair Based

Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy load.

Key-value pair storage databases store data as a hash table where each key is unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.

For example, a key-value pair may contain a key like “Website” associated with a value like “Guru99”.

It is one of the most basic NoSQL database example. This kind of NoSQL database is used as a collection, dictionaries, associative arrays, etc. Key value stores help the developer to store schema-less data. They work best for shopping cart contents.

Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all based on Amazon’s Dynamo paper.

Column-based

Column-oriented databases work on columns and are based on BigTable paper by Google. Every column is treated separately. Values of single column databases are stored contiguously.

Column based NoSQL database

They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as the data is readily available in a column.

Column-based NoSQL databases are widely used to manage data warehouses, business intelligence, CRM, Library card catalogs,

HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.

Document-Oriented:

Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is stored as a document. The document is stored in JSON or XML formats. The value is understood by the DB and can be queried.

Relational Vs. Document

In this diagram on your left you can see we have rows and columns, and in the right, we have a document database which has a similar structure to JSON. Now for the relational database, you have to know what columns you have and so on. However, for a document database, you have data store like JSON object. You do not require to define which make it flexible.

The document type is mostly used for CMS systems, blogging platforms, real-time analytics & e-commerce applications. It should not use for complex transactions which require multiple operations or queries against varying aggregate structures.

Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular Document originated DBMS systems.

Graph-Based

A graph type database stores entities as well the relations amongst those entities. The entity is stored as a node with the relationship as edges. An edge gives a relationship between nodes. Every node and edge has a unique identifier.

Compared to a relational database where tables are loosely connected, a Graph database is a multi-relational in nature. Traversing relationship is fast as they are already captured into the DB, and there is no need to calculate them.

Graph base database mostly used for social networks, logistics, spatial data.

Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.

Query Mechanism tools for NoSQL

The most common data retrieval mechanism is the REST-based retrieval of a value based on its key/ID with GET resource

Document store Database offers more difficult queries as they understand the value in a key-value pair. For example, CouchDB allows defining views with MapReduce

What is the CAP Theorem?

CAP theorem is also called brewer’s theorem. It states that is impossible for a distributed data store to offer more than two out of three guarantees

  1. Consistency

  2. Availability

  3. Partition Tolerance

Consistency:

The data should remain consistent even after the execution of an operation. This means once data is written, any future read request should contain that data. For example, after updating the order status, all the clients should be able to see the same data.

Availability:

The database should always be available and responsive. It should not have any downtime.

Partition Tolerance:

Partition Tolerance means that the system should continue to function even if the communication among the servers is not stable. For example, the servers can be partitioned into multiple groups which may not communicate with each other. Here, if part of the database is unavailable, other parts are always unaffected.

Eventual Consistency

The term “eventual consistency” means to have copies of data on multiple machines to get high availability and scalability. Thus, changes made to any data item on one machine has to be propagated to other replicas.

Data replication may not be instantaneous as some copies will be updated immediately while others in due course of time. These copies may be mutually, but in due course of time, they become consistent. Hence, the name eventual consistency.

BASE: Basically Available, Soft state, Eventual consistency

  • Basically, available means DB is available all the time as per CAP theorem

  • Soft state means even without an input; the system state may change

  • Eventual consistency means that the system will become consistent over time

Advantages of NoSQL

  • Can be used as Primary or Analytic Data Source

  • Big Data Capability

  • No Single Point of Failure

  • Easy Replication

  • No Need for Separate Caching Layer

  • It provides fast performance and horizontal scalability.

  • Can handle structured, semi-structured, and unstructured data with equal effect

  • Object-oriented programming which is easy to use and flexible

  • NoSQL databases don’t need a dedicated high-performance server

  • Support Key Developer Languages and Platforms

  • Simple to implement than using RDBMS

  • It can serve as the primary data source for online applications.

  • Handles big data which manages data velocity, variety, volume, and complexity

  • Excels at distributed database and multi-data center operations

  • Eliminates the need for a specific caching layer to store data

  • Offers a flexible schema design which can easily be altered without downtime or service disruption

Disadvantages of NoSQL

  • No standardization rules

  • Limited query capabilities

  • RDBMS databases and tools are comparatively mature

  • It does not offer any traditional database capabilities, like consistency when multiple transactions are performed simultaneously.

  • When the volume of data increases it is difficult to maintain unique values as keys become difficult

  • Doesn’t work as well with relational data

  • The learning curve is stiff for new developers

  • Open source options so not so popular for enterprises.

Summary

  • NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to scale

  • The concept of NoSQL databases became popular with Internet giants like Google, Facebook, Amazon, etc. who deal with huge volumes of data

  • In the year 1998- Carlo Strozzi use the term NoSQL for his lightweight, open-source relational database

  • NoSQL databases never follow the relational model it is either schema-free or has relaxed schemas

  • Four types of NoSQL Database are 1). Key-value Pair Based 2). Column-oriented Graph 3). Graphs based 4). Document-oriented

  • NOSQL can handle structured, semi-structured, and unstructured data with equal effect

  • CAP theorem consists of three words Consistency, Availability, and Partition Tolerance

  • BASE stands for Basically Available, Soft state, Eventual consistency

  • The term “eventual consistency” means to have copies of data on multiple machines to get high availability and scalability

  • NOSQL offer limited query capabilities

NoSQL 分类

NoSQL databases offer many benefits over relational databases. NoSQL databases have flexible data models, scale horizontally, have incredibly fast queries, and are easy for developers to work with.

Over time, four major types of NoSQL databases emerged: document databases, key-value databases, wide-column stores, and graph databases. Let’s examine each type.

  • Document databases store data in documents like JSON (JavaScript Object Notation) objects. Each document contains pairs of fields and values. The values can typically be a variety of types, including strings, numbers, Booleans, arrays, or objects. Their structures typically align with objects developers are working within code. Because of their variety of field value types and powerful query languages, document databases are great for a wide variety of use cases and can be used as a general-purpose database. They can horizontally scale out to accommodate large data volumes. MongoDB and Cloud FireStore are popular document databases.

  • Key-value databases are a simpler type of database where each item contains keys and values. A value can typically only be retrieved by referencing its key, so learning how to query for a specific key-value pair is typically simple. Key-value databases are great for use cases where you need to store substantial amounts of data, but you don’t need to perform complex queries to retrieve it. Common use cases include storing user preferences or caching. Redis and DynamoDB are popular key-value databases.

  • Wide-column stores store data in tables, rows, and dynamic columns. It provides a lot of flexibility over relational databases because each row is not required to have the same columns. Many consider wide-column stores to be 2-D key-value databases. Wide-column stores are great for when you need to store enormous amounts of data and you can predict what your query patterns will be. Wide-column stores are commonly used for storing Internet of Things data and user profile data. Cassandra and HBase are two of the most popular wide-column stores.

  • Graph databases store data in nodes and edges. Nodes typically store information about people, places, and things while edges store information about the relationships between the nodes. They can run smoothly across various scenarios where you need to find patterns in data such as social networks, fraud detection, and recommendation engine use cases. Neo4j and Janus Graph are examples of graph databases.

Tell me the Key Differences between SQL and NoSQL

9 NOSQL Databases in 2022

NoSQL databases have different database models compared to their RDBMS (Relational Database Management Systems) counterparts. These systems can be divided into four distinct groups.

  1. Key/value based: These databases work by matching key with specific values, similar to a map or dictionary. They are efficient, extremely performant, and easily scalable.

  2. Column based: These databases work by creating collections of one or more key/value pairs that match a specific record. They are also referred to as extensible record stores, wide columnar stores, or column oriented stores.

  3. Document based: Key value pairs are encapsulated in JSON or JSON like documents. The keys within each document have to be unique. Unlike key based, the values are not opaque to the system and can be queried.

  4. Graph based: These databases are specialized in efficient management of heavily linked data.

Column-Family

Apache HBase

Known for running on top of HDFS (Hadoop Distributed File System), Apache HBase is secure, scalable, distributed, secure, and offers high availability. HBase is capable of handling large data tables containing millions of columns and billions of rows while utilizing CPU, memory, and storage resources across multiple servers within a cluster. Hadoop’s reduce/map structure is ideal for complex computational jobs or queries that are farmed out to every node.

Cassandra

The Apache Cassandra project emerged out of Facebook in 2008 and has now become a fully mature database tool used for most large data stores. It offers high availability, fault-tolerance, and scalability on cloud infrastructure, virtual systems, or hardware. Cassandra’s mechanism provides a hybrid mixture of a key/value store with a column-oriented database. With log-structured updates, column indexing, materialized and denormalized views, and built in caching, Cassandra has become the ideal tool for large scale organizations that need to store data too large to fit on a server.

Hypertable

Hypertable is modeled after Google’s Bigtable; it uses a block and key-prefix data compression and has a flattened out table structure. Aside from the fact that data is represented in tables of information in columns and rows; Hypertable has little resemblance to a traditional RDBMS. Notable features include ‘realtime’ scaling, cell versioning, namespaces, and column qualifiers. Hypertable can be used as an alternative to HBase or Accumulo.

Other NoSQL databases in the column family include; Accumulo, Amazon SimpleDB, Clouddata, Cloudera, HPCC, Apache Flink (formerly referred to as Stratosphere), and Splice Machine.8

Document Store

CouchDB

CouchDB is a specifically built for web application database needs; it completely lacks a pre-defined schema or data structure. Data arrives in JavaScript’s JSON format, its queries are written in JavaScript, and the data goes back in JSON. CouchDB supports both mobile and web applications (CouchDB can be used offline in the background of mobile apps). Using JavaScript for description, CouchDB aggregates, joins, and reports on database documents without affecting the underlying structure of the documents. It is ideal for accumulating and occasionally changing data, on which pre-defined queries are to be run.

MongoDB

MongoDB is an open source document database written in C++. It has all the traditional features that define NoSQL: JavaScript formatting, value/key storage, and flexible replication for sharding. All data is written based on a philosophy MongoDB refers to as multi-version concurrency control; this is a structure where older versions of the data are kept around to help maintain consistency in complex transactions. A major advantage of MongoDB is the embedded arrays and documents, which reduce the need for expensive joins. On top of that, its dynamic schema supports articulate polymorphism and documents correspond to native data types in most programming languages.

Other NoSQL databases in the Document Store category include; Elasticsearch, Couchbase Server, RethinkDB, RavenDB, MarkLogic Server, NeDB, Terrastore, JasDB, RaptorDB, djondb, EJDB, Amisa Server, densodb, SisoDB, and ThruDB.

Graph Based

Neo4j

Unlike other NoSQL databases that store flexible bundles of values and keys; Neo4j stores the relationships between objects, a structured commonly referred to as ‘graph’ by mathematicians. Neo4j includes several algorithms for analyzing and searching the relationships, enabling users to efficiently search based on different relationships. The use ‘graph traversal’ algorithms eliminate the trouble of chasing pointers. Neo4j is ideally used interconnected, rich or complex, graph-style data.

Other graph based NoSQL databases include; OrientDB, FlockDB, Infinite Graph, DEX, TITAN, InfoGrid, HyperGraphDB, GraphBase, and Trinity.

Key-Value Based

Redis

Redis is an in-memory, networked, key-value data store NoSQL database written in ANSI C. Its key features include: improved performance through in-memory storage, master-slave replication, and dictionary data model key-mapped to values. Redis also provides alpha stage clustering in PaaS and IaaS platforms. It can also be used as a managed service without launching the VM instance of the database.

Riak

Riak can be viewed as both a distributed database and cloud storage solution. It is a database geared towards offering cloud storage to any scale in both public and private clouds; by providing eventual consistency to data stored on a collection of nodes that can grow anytime there is a rise in demand. Map/reduce queries in Riak can be written in either Erlang or JavaScript. Data stored in Riak is private by default; however, data visibility can be refined further using Access Control Lists.

Other key-value based NoSQL databases include: DynamoDB, LevelDB, Aerospike, FoundationDB, Berkeley DB, Oracle NoSQL Database, GenieDB, BangDB, and Scalaris.

Conclusion

NoSQL databases are progressively becoming a key component of the database landscape; especially as more organizations begin to realize that operating at scale is better achieved on clusters of standard, commodity servers, and that a schema-less data model is more ideal for the type and variety of data captured and processed today. When optimally used, NoSQL databases can provide several benefits; however, enterprises should ensure they are fully aware of the legitimate issues and limitations associated with NoSQL databases before adopting them.

以上是关于大数据存储引擎 NoSQL极简教程 An Introduction to Big Data: NoSQL的主要内容,如果未能解决你的问题,请参考以下文章

主流的 OLAP 引擎介绍 - OLAP极简教程

Flink安装极简教程

搜索引擎Elasticsearch,了解一下?

mysql 支持的存储引擎(极简版)

大数据 OLAPApache Kudu 极简教程

大数据学习总结记录—SSDB