SQL学习指南:SQL和大数据

Posted 人邮异步社区

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SQL学习指南:SQL和大数据相关的知识,希望对你有一定的参考价值。

由于SQL被数百万的人使用并且被集成到数千的应用程序中,因此利用SQL来处理这些数据是有意义的。在过去的几年里,涌现出了一批新的工具以支持SQL访问结构化、半结构化和非结构化的数据,这些工具包括Presto、Apache Drill和Toad Data Point等。本章探讨了其中之一的Apache Drill,用于演示如何将存储在不同服务器上格式各异的数据汇集起来用于报告和分析。

18.1 Apache Drill简介

目前已经开发出了各种工具和接口以供SQL访问存储在Hadoop、NoSQL、Spark以及基于云端的分布式文件系统中的数据。比如Hive和Spark SQL,前者是允许用户查询存储在Hadoop中数据的初期尝试之一,而后者是一个库,用于查询以各种格式存储在Spark中的数据。Apache Drill是一款相对较新的开源工具,初现于2015年,具有下列引人瞩目的特性:

  • 促进跨多种数据格式的查询,包括分隔数据、JSON、Parquet和日志文件;
  • 连接到关系型数据库、Hadoop、NoSQL、HBase、Kafka,以及PCAP、BlockChain等专用数据格式;
  • 允许创建自定义插件,以连接到大多数其他数据存储;
  • 不需要前期的模式定义;
  • 支持SQL:2003标准;
  • 可与Tableau、Apache Superset等流行的商业智能(business intelligence,BI)工具配合使用。

通过使用Drill,可以连接到任意数量的数据源进行查询,无须先设置元数据仓库。

18.2 使用Drill查询文件

我们从使用Drill来查询文件中的数据开始。Drill知道如何读取多种不同的文件格式,包括抓包(PCAP)文件,这是一种包含网络中包信息的二进制格式文件。如果想查询PCAP文件,所要做的就是配置Drill的分布式文件系统插件dfs,以纳入包含PCAP文件目录的路径,接下来就可以编写查询了。

要做的第一件事是找出待查询文件中包含哪些列。Drill对information_schema(参见第15章)提供了部分支持,可以在工作区中通过其查看数据文件相关的高层级信息:

apache drill> SELECT file_name, is_directory, is_file, permission
. . . . . . > FROM information_schema.`files`
. . . . . . > WHERE schema_name = 'dfs.data';
+-------------------+--------------+---------+------------+
|         file_name | is_directory | is_file | permission |
+-------------------+--------------+---------+------------+
| attack-trace.pcap | false        | true     | rwxrwx--- |
+-------------------+--------------+---------+------------+
1 row selected (0.238 seconds)

结果显示,在数据工作区中有一个名为attack-trace.pcap的文件,该信息很有用,但无法查询 information_schema.columns列,以找出文件中有哪些列。但是,对该文件执行不返回任何结果的查询会显示出可用列[1]

apache drill> SELECT * FROM dfs.data.`attack-trace.pcap`
. . . . . . > WHERE 1=2;
+------+---------+-----------+-----------------+--------+--------+
| type | network | timestamp | timestamp_micro | src_ip | dst_ip |
+------+---------+-----------+-----------------+--------+--------+
    ----------+----------+-----------------+-----------------+-------------+
     src_port | dst_port | src_mac_address | dst_mac_address | tcp_session |
    ----------+----------+-----------------+-----------------+-------------+
    ---------+-----------+--------------+---------------+----------------+
     tcp_ack | tcp_flags | tcp_flags_ns | tcp_flags_cwr |  tcp_flags_ece |
    ---------+-----------+--------------+---------------+----------------+
    ---------------------------+--------------------------------------+
     tcp_flags_ece_ecn_capable | tcp_flags_ece_congestion_experienced |
    ---------------------------+--------------------------------------+
    ---------------+---------------+---------------+---------------+
     tcp_flags_urg | tcp_flags_ack | tcp_flags_psh | tcp_flags_rst |
    ---------------+---------------+---------------+---------------+
    ---------------+---------------+------------------+---------------+
     tcp_flags_syn | tcp_flags_fin | tcp_parsed_flags | packet_length |
    ---------------+---------------+------------------+---------------+
    ------------+------+
     is_corrupt | data |
    ------------+------+

No rows selected (0.285 seconds)

现在已经知道了PCAP文件中的列名,可以编写查询了。下列查询统计从每个IP地址向各个目标端口发送的包数量:

apache drill> SELECT src_ip, dst_port,
. . . . . . > count(*) AS packet_count
. . . . . . > FROM dfs.data.`attack-trace.pcap`
. . . . . . > GROUP BY src_ip, dst_port;
+----------------+----------+--------------+
|     src_ip     | dst_port | packet_count |
+----------------+----------+--------------+
| 98.114.205.102 | 445      | 18           |
| 192.150.11.111 | 1821     | 3            |
| 192.150.11.111 | 1828     | 17           |
| 98.114.205.102 | 1957     | 6            |
| 192.150.11.111 | 1924     | 6            |
| 192.150.11.111 | 8884     | 15           |
| 98.114.205.102 | 36296    | 12           |
| 98.114.205.102 | 1080     | 159          |
| 192.150.11.111 | 2152     | 112          |
+----------------+----------+--------------+
9 rows selected (0.254 seconds)

下列查询聚合了每秒包信息:

apache drill> SELECT trunc(extract(second from `timestamp`)) as packet_time,
. . . . . . > count(*) AS num_packets,
. . . . . . > sum(packet_length) AS tot_volume
. . . . . . > FROM dfs.data.`attack-trace.pcap`
. . . . . . > GROUP BY trunc(extract(second from `timestamp`));
+-------------+-------------+------------+
| packet_time | num_packets | tot_volume |
+-------------+-------------+------------+
| 28.0        | 15          | 1260       |
| 29.0        | 12          | 1809       |
| 30.0        | 13          | 4292       |
| 31.0        | 3           | 286        |
| 32.0        | 2           | 118        |
| 33.0        | 15          | 1054       |
| 34.0        | 35          | 14446      |
| 35.0        | 29          | 16926      |
| 36.0        | 25          | 16710      |
| 37.0        | 25          | 16710      |
| 38.0        | 26          | 17788      |
| 39.0        | 23          | 15578      |
| 40.0        | 25          | 16710      |
| 41.0        | 23          | 15578      |
| 42.0        | 30          | 20052      |
| 43.0        | 25          | 16710      |
| 44.0        | 22          | 7484       |
+-------------+-------------+------------+
17 rows selected (0.422 seconds)

在该查询中,需要在保留字timestamp两边加上反引号(`)。

无论文件存储在本地、网络、分布式文件系统或云端,都可以进行查询。Drill内建了多种文件类型支持,还可以自行编写插件,以允许Drill查询其他类型的文件。接下来将探讨查询存储在数据库中的数据。

18.3 使用Drill查询mysql

Drill可以通过JDBC驱动程序连接到任何关系型数据库,因此下一步就是展示Drill如何查询用于本书示例的Sakila样本数据库。你要做的就是为MySQL加载JDBC驱动程序并配置Drill连接到 MySQL数据库。 

这里你可能会好奇,“我为什么要用Drill来查询MySQL呢?” 一个原因是(正如你将在本章结尾看到的)可以使用Drill编写查询,组合不同来源的数据,例如,你可能会编写一个查询,将来自MySQL、Hadoop和以逗号分隔的文件的数据连接起来。

第一步是选择一个数据库:

apache drill (information_schema)> use mysql.sakila;
+------+------------------------------------------+
| ok   |                 summary                  |
+------+------------------------------------------+
| true | Default schema changed to [mysql.sakila] |
+------+------------------------------------------+
1 row selected (0.062 seconds)

选择好数据库之后,可以使用show tables命令查看在所选模式中所有可用的数据表:

apache drill (mysql.sakila)> show tables;
+--------------+----------------------------+
| TABLE_SCHEMA |         TABLE_NAME         |
+--------------+----------------------------+
| mysql.sakila | actor                      |
| mysql.sakila | address                    |
| mysql.sakila | category                   |
| mysql.sakila | city                       |
| mysql.sakila | country                    |
| mysql.sakila | customer                   |
| mysql.sakila | film                       |
| mysql.sakila | film_actor                 |
| mysql.sakila | film_category              |
| mysql.sakila | film_text                  |
| mysql.sakila | inventory                  |
| mysql.sakila | language                   |
| mysql.sakila | payment                    |
| mysql.sakila | rental                     |
| mysql.sakila | sales                      |
| mysql.sakila | staff                      |
| mysql.sakila | store                      |
| mysql.sakila | actor_info                 |
| mysql.sakila | customer_list              |
| mysql.sakila | film_list                  |
| mysql.sakila | nicer_but_slower_film_list |
| mysql.sakila | sales_by_film_category     |
| mysql.sakila | sales_by_store             |
| mysql.sakila | staff_list                 |
+--------------+----------------------------+
24 rows selected (0.147 seconds)

先来执行几个之前章节演示过的查询。下面这个简单的双表连接来自第5章:

apache drill (mysql.sakila)> SELECT a.address_id, a.address, ct.city
. . . . . . . . . . . . . )> FROM address a
. . . . . . . . . . . . . )>   INNER JOIN city ct
. . . . . . . . . . . . . )>   ON a.city_id = ct.city_id
. . . . . . . . . . . . . )> WHERE a.district = 'California';
+------------+------------------------+----------------+
| address_id |        address         |      city      |
+------------+------------------------+----------------+
| 6          | 1121 Loja Avenue       | San Bernardino |
| 18         | 770 Bydgoszcz Avenue   | Citrus Heights |
| 55         | 1135 Izumisano Parkway | Fontana        |
| 116        | 793 Cam Ranh Avenue    | Lancaster      |
| 186        | 533 al-Ayn Boulevard   | Compton        |
| 218        | 226 Brest Manor        | Sunnyvale      |
| 274        | 920 Kumbakonam Loop    | Salinas        |
| 425        | 1866 al-Qatif Avenue   | El Monte       |
| 599        | 1895 Zhezqazghan Drive | Garden Grove   |
+------------+------------------------+----------------+
9 rows selected (3.523 seconds)

下一个查询来自第8章,包含group by子句和having子句:

apache drill (mysql.sakila)> SELECT fa.actor_id, f.rating,
. . . . . . . . . . . . . )>   count(*) num_films
. . . . . . . . . . . . . )> FROM film_actor fa
. . . . . . . . . . . . . )>   INNER JOIN film f
. . . . . . . . . . . . . )>   ON fa.film_id = f.film_id
. . . . . . . . . . . . . )> WHERE f.rating IN ('G','PG')
. . . . . . . . . . . . . )> GROUP BY fa.actor_id, f.rating
. . . . . . . . . . . . . )> HAVING count(*) > 9;
+----------+--------+-----------+
| actor_id | rating | num_films |
+----------+--------+-----------+
| 137      | PG     | 10        |
| 37       | PG     | 12        |
| 180      | PG     | 12        |
| 7        | G      | 10        |
| 83       | G      | 14        |
| 129      | G      | 12        |
| 111      | PG     | 15        |
| 44       | PG     | 12        |
| 26       | PG     | 11        |
| 92       | PG     | 12        |
| 17       | G      | 12        |
| 158      | PG     | 10        |
| 147      | PG     | 10        |
| 14       | G      | 10        |
| 102      | PG     | 11        |
| 133      | PG     | 10        |
+----------+--------+-----------+
16 rows selected (0.277 seconds)

最后这个查询来自第16章,包含3个不同的排名函数:

apache drill (mysql.sakila)> SELECT customer_id, count(*) num_rentals,
. . . . . . . . . . . . . )>   row_number()
. . . . . . . . . . . . . )>     over (order by count(*) desc)
. . . . . . . . . . . . . )>       row_number_rnk,
. . . . . . . . . . . . . )>   rank()
. . . . . . . . . . . . . )>     over (order by count(*) desc) rank_rnk,
. . . . . . . . . . . . . )>   dense_rank()
. . . . . . . . . . . . . )>     over (order by count(*) desc)
. . . . . . . . . . . . . )>       dense_rank_rnk
. . . . . . . . . . . . . )> FROM rental
. . . . . . . . . . . . . )> GROUP BY customer_id
. . . . . . . . . . . . . )> ORDER BY 2 desc;
+-------------+-------------+----------------+----------+----------------+
| customer_id | num_rentals | row_number_rnk | rank_rnk | dense_rank_rnk |
+-------------+-------------+----------------+----------+----------------+
| 148         | 46          | 1              | 1        | 1              |
| 526         | 45          | 2              | 2        | 2              |
| 144         | 42          | 3              | 3        | 3              |
| 236         | 42          | 4              | 3        | 3              |
| 75          | 41          | 5              | 5        | 4              |
| 197         | 40          | 6              | 6        | 5              |
...
| 248         | 15          | 595            | 594      | 30             |
| 61          | 14          | 596            | 596      | 31             |
| 110         | 14          | 597            | 596      | 31             |
| 281         | 14          | 598            | 596      | 31             |
| 318         | 12          | 599            | 599      | 32             |
+-------------+-------------+----------------+----------+----------------+
599 rows selected (1.827 seconds)

这些示例演示了Drill执行复杂的MySQL查询的能力,但别忘了,Drill还能处理MySQL之外的很多关系型数据库,所以有些语言特性可能会有所差异(比如数据转换函数)。

18.4 使用Drill查询MongoDB

在使用Drill查询MySQL中的Sakila样本数据之后,下一步是将Sakila数据转换成另一种常用的格式并存储在非关系型数据库中,然后使用Drill查询数据。我决定将数据转换为JSON格式并存储在MongoDB中,MongoDB是用于文档存储的一种比较流行的NoSQL平台。Drill包含MongoDB插件,也知道如何读取JSON文档,所以将JSON文件加载到Mongo并编写查询还是比较容易的。

在编写查询之前,先来看看JSON文件的结构,毕竟JSON文件不是规范化形式。films.json是两个JSON文件中的第一个:

"_id":1,
 "Actors":[
   "First name":"PENELOPE","Last name":"GUINESS","actorId":1,
   "First name":"CHRISTIAN","Last name":"GABLE","actorId":10,
   "First name":"LUCILLE","Last name":"TRACY","actorId":20,
   "First name":"SANDRA","Last name":"PECK","actorId":30,
   "First name":"JOHNNY","Last name":"CAGE","actorId":40,
   "First name":"MENA","Last name":"TEMPLE","actorId":53,
   "First name":"WARREN","Last name":"NOLTE","actorId":108,
   "First name":"OPRAH","Last name":"KILMER","actorId":162,
   "First name":"ROCK","Last name":"DUKAKIS","actorId":188,
   "First name":"MARY","Last name":"KEITEL","actorId":198],
 "Category":"Documentary",
 "Description":"A Epic Drama of a Feminist And a Mad Scientist
    who must Battle a Teacher in The Canadian Rockies",
 "Length":"86",
 "Rating":"PG",
 "Rental Duration":"6",
 "Replacement Cost":"20.99",
 "Special Features":"Deleted Scenes,Behind the Scenes",
 "Title":"ACADEMY DINOSAUR",
"_id":2,
 "Actors":[
   "First name":"BOB","Last name":"FAWCETT","actorId":19,
   "First name":"MINNIE","Last name":"ZELLWEGER","actorId":85,
   "First name":"SEAN","Last name":"GUINESS","actorId":90,
   "First name":"CHRIS","Last name":"DEPP","actorId":160],
 "Category":"Horror",
 "Description":"A Astounding Epistle of a Database Administrator
    And a Explorer who must Find a Car in Ancient China",
 "Length":"48",
 "Rating":"G",
 "Rental Duration":"3",
 "Replacement Cost":"12.99",
 "Special Features":"Trailers,Deleted Scenes",
 "Title":"ACE GOLDFINGER",
...
"_id":999,
 "Actors":[
   "First name":"CARMEN","Last name":"HUNT","actorId":52,
   "First name":"MARY","Last name":"TANDY","actorId":66,
   "First name":"PENELOPE","Last name":"CRONYN","actorId":104,
   "First name":"WHOOPI","Last name":"HURT","actorId":140,
   "First name":"JADA","Last name":"RYDER","actorId":142],
 "Category":"Children",
 "Description":"A Fateful Reflection of a Waitress And a Boat
    who must Discover a Sumo Wrestler in Ancient China",
 "Length":"101",
 "Rating":"R",
 "Rental Duration":"5",
 "Replacement Cost":"28.99",
 "Special Features":"Trailers,Deleted Scenes",
 "Title":"ZOOLANDER FICTION"
"_id":1000,
 "Actors":[
   "First name":"IAN","Last name":"TANDY","actorId":155,
   "First name":"NICK","Last name":"DEGENERES","actorId":166,
   "First name":"LISA","Last name":"MONROE","actorId":178],
 "Category":"Comedy",
 "Description":"A Intrepid Panorama of a Mad Scientist And a Boy
    who must Redeem a Boy in A Monastery",
 "Length":"50",
 "Rating":"NC-17",
 "Rental Duration":"3",
 "Replacement Cost":"18.99",
 "Special Features":
 "Trailers,Commentaries,Behind the Scenes",
 "Title":"ZORRO ARK"

该集合中有1,000个文档,每个文档中包含若干标量属性(Title、Rating、_id),另外还有一个名为Actors的列表,其中包含1到N个元素,由电影中所有参演演员的演员ID、名字和姓氏组成。因此,该文件包含了MySQL Sakila数据库的actor、film和film_actor数据表中的所有数据。

第二个文件是customer.json,其中的数据来自MySQL Sakila数据库的数据表customer、address、city、country和payment:

"_id":1,
 "Address":"1913 Hanoi Way",
 "City":"Sasebo",
 "Country":"Japan",
 "District":"Nagasaki",
 "First Name":"MARY",
 "Last Name":"SMITH",
 "Phone":"28303384290",
 "Rentals":[
   "rentalId":1185,
    "filmId":611,
    "staffId":2,
    "Film Title":"MUSKETEERS WAIT",
    "Payments":[
      "Payment Id":3,"Amount":5.99,"Payment Date":"2005-06-15 00:54:12"],
    "Rental Date":"2005-06-15 00:54:12.0",
    "Return Date":"2005-06-23 02:42:12.0",
   "rentalId":1476,
    "filmId":308,
    "staffId":1,
    "Film Title":"FERRIS MOTHER",
    "Payments":[
      "Payment Id":5,"Amount":9.99,"Payment Date":"2005-06-15 21:08:46"],
    "Rental Date":"2005-06-15 21:08:46.0",
    "Return Date":"2005-06-25 02:26:46.0",
...
   "rentalId":14825,
    "filmId":317,
    "staffId":2,
    "Film Title":"FIREBALL PHILADELPHIA",
    "Payments":[
      "Payment Id":30,"Amount":1.99,"Payment Date":"2005-08-22 01:27:57"],
    "Rental Date":"2005-08-22 01:27:57.0",
    "Return Date":"2005-08-27 07:01:57.0"
   ]

该文件包含599个条目(这里只显示了其中的一个),它们作为customers集合中的599个文档被加载到MongoDB。每个文档包含单个客户的信息,以及该客户所有的租借和相关付款信息。此外,文档中还有嵌套列表,因为Rentals列表的租借信息中也包含一个Payments列表。

JSON文件被加载之后,MongoDB包含两个集合(films和customers),这些集合中的数据来自MySQL Sakila数据库的9个不同的数据表。这是一个相当典型的场景,因为应用程序的程序员通常与集合打交道,一般不喜欢将他们的数据解构以存储到规范化的关系数据表中。从SQL的角度来看,挑战在于决定如何将这些数据扁平化,使其表现得就像存储在多个数据表中一样。

作为演示,我们来编写针对films集合的查询:找出参演过10部或以上的G级或PG级电影的所有演员。原始数据如下所示:

apache drill (mongo.sakila)> SELECT Rating, Actors
. . . . . . . . . . . . . )> FROM films
. . . . . . . . . . . . . )> WHERE Rating IN ('G','PG');
+--------+----------------------------------------------------------------+
| Rating |                                      Actors                    |
+--------+----------------------------------------------------------------+
| PG     |["First name":"PENELOPE","Last name":"GUINESS","actorId":"1",
           "First name":"FRANCES","Last name":"DAY-LEWIS","actorId":"48",
           "First name":"ANNE","Last name":"CRONYN","actorId":"49",
           "First name":"RAY","Last name":"JOHANSSON","actorId":"64",
           "First name":"PENELOPE","Last name":"CRONYN","actorId":"104",
           "First name":"HARRISON","Last name":"BALE","actorId":"115",
           "First name":"JEFF","Last name":"SILVERSTONE","actorId":"180",
           "First name":"ROCK","Last name":"DUKAKIS","actorId":"188"]   |
| PG     |["First name":"UMA","Last name":"WOOD","actorId":"13",
           "First name":"HELEN","Last name":"VOIGHT","actorId":"17",
           "First name":"CAMERON","Last name":"STREEP","actorId":"24",
           "First name":"CARMEN","Last name":"HUNT","actorId":"52",
           "First name":"JANE","Last name":"JACKMAN","actorId":"131",
           "First name":"BELA","Last name":"WALKEN","actorId":"196"]    |
...
| G      |["First name":"ED","Last name":"CHASE","actorId":"3",
           "First name":"JULIA","Last name":"MCQUEEN","actorId":"27",
           "First name":"JAMES","Last name":"PITT","actorId":"84",
           "First name":"CHRISTOPHER","Last name":"WEST","actorId":"163",
           "First name":"MENA","Last name":"HOPPER","actorId":"170"]    |
+--------+----------------------------------------------------------------+
372 rows selected (0.432 seconds)

Actors字段是一个列表,包含一个或多个演员文档。为了像与数据表交互一样与这些数据进行交互,可以使用flatten命令将列表转换为包含3个字段的嵌套数据表:

apache drill (mongo.sakila)> SELECT f.Rating, flatten(Actors) actor_list
. . . . . . . . . . . . . )>   FROM films f
. . . . . . . . . . . . . )>   WHERE f.Rating IN ('G','PG');
+--------+----------------------------------------------------------------+
| Rating |                             actor_list                         |
+--------+----------------------------------------------------------------+
| PG     | "First name":"PENELOPE","Last name":"GUINESS","actorId":"1"  |
| PG     | "First name":"FRANCES","Last name":"DAY-LEWIS","actorId":"48"|
| PG     | "First name":"ANNE","Last name":"CRONYN","actorId":"49"      |
| PG     | "First name":"RAY","Last name":"JOHANSSON","actorId":"64"    |
| PG     | "First name":"PENELOPE","Last name":"CRONYN","actorId":"104" |
| PG     | "First name":"HARRISON","Last name":"BALE","actorId":"115"   |
| PG     | "First name":"JEFF","Last name":"SILVERSTONE","actorId":"180"|
| PG     | "First name":"ROCK","Last name":"DUKAKIS","actorId":"188"    |
| PG     | "First name":"UMA","Last name":"WOOD","actorId":"13"         |
| PG     | "First name":"HELEN","Last name":"VOIGHT","actorId":"17"     |
| PG     | "First name":"CAMERON","Last name":"STREEP","actorId":"24"   |
| PG     | "First name":"CARMEN","Last name":"HUNT","actorId":"52"      |
| PG     | "First name":"JANE","Last name":"JACKMAN","actorId":"131"    |
| PG     | "First name":"BELA","Last name":"WALKEN","actorId":"196"     |
...
| G      | "First name":"ED","Last name":"CHASE","actorId":"3"          |
| G      | "First name":"JULIA","Last name":"MCQUEEN","actorId":"27"    |
| G      | "First name":"JAMES","Last name":"PITT","actorId":"84"       |
| G      | "First name":"CHRISTOPHER","Last name":"WEST","actorId":"163"|
| G      | "First name":"MENA","Last name":"HOPPER","actorId":"170"     |
+--------+----------------------------------------------------------------+
2,119 rows selected (0.718 seconds)      |  

该查询返回2,119行,而不是上个查询返回的372行,这表明每部G级或PG级电影平均有5.7位演员参演。可以将这个查询放入子查询中,用于按照分级或演员对数据进行分组:

apache drill (mongo.sakila)> SELECT g_pg_films.Rating,
. . . . . . . . . . . . . )>   g_pg_films.actor_list.`First name` first_name
. . . . . . . . . . . . . )>   g_pg_films.actor_list.`Last name` last_name,
. . . . . . . . . . . . . )>   count(*) num_films
. . . . . . . . . . . . . )> FROM
. . . . . . . . . . . . . )>  (SELECT f.Rating, flatten(Actors) actor_list
. . . . . . . . . . . . . )>   FROM films f
. . . . . . . . . . . . . )>   WHERE f.Rating IN ('G','PG')
. . . . . . . . . . . . . )> ) g_pg_films
. . . . . . . . . . . . . )> GROUP BY g_pg_films.Rating,
. . . . . . . . . . . . . )>   g_pg_films.actor_list.`First name`,
. . . . . . . . . . . . . )>   g_pg_films.actor_list.`Last name`
. . . . . . . . . . . . . )> HAVING count(*) > 9;
+--------+------------+-------------+-----------+
| Rating | first_name |  last_name  | num_films |
+--------+------------+-------------+-----------+
| PG     | JEFF       | SILVERSTONE | 12        |
| G      | GRACE      | MOSTEL      | 10        |
| PG     | WALTER     | TORN        | 11        |
| PG     | SUSAN      | DAVIS       | 10        |
| PG     | CAMERON    | ZELLWEGER   | 15        |
| PG     | RIP        | CRAWFORD    | 11        |
| PG     | RICHARD    | PENN        | 10        |
| G      | SUSAN      | DAVIS       | 13        |
| PG     | VAL        | BOLGER      | 12        |
| PG     | KIRSTEN    | AKROYD      | 12        |
| G      | VIVIEN     | BERGEN      | 10        |
| G      | BEN        | WILLIS      | 14        |
| G      | HELEN      | VOIGHT      | 12        |
| PG     | VIVIEN     | BASINGER    | 10        |
| PG     | NICK       | STALLONE    | 12        |
| G      | DARYL      | CRAWFORD    | 12        |
| PG     | MORGAN     | WILLIAMS    | 10        |
| PG     | FAY        | WINSLET     | 10        |
+--------+------------+-------------+-----------+
18 rows selected (0.466 seconds)

内查询使用flatten命令为每位参演过G级或PG级电影的演员创建一行,外查询则简单地对该数据集进行分组。

接下来,针对MongoDB中的customers集合编写查询。这有点难度,因为每个文档都含有一个电影租借列表,每个租借列表都含有一个付款列表。为了再增加点乐趣,还可以连接films集合,看看Drill如何处理连接。该查询应该返回租借过G级或PG级电影且租借费超过80美元的所有客户,如下所示:

apache drill (mongo.sakila)> SELECT first_name, last_name,
. . . . . . . . . . . . . )>   sum(cast(cust_payments.payment_data.Amount
. . . . . . . . . . . . . )>         as decimal(4,2))) tot_payments
. . . . . . . . . . . . . )> FROM
. . . . . . . . . . . . . )>  (SELECT cust_data.first_name,
. . . . . . . . . . . . . )>     cust_data.last_name,
. . . . . . . . . . . . . )>     f.Rating,
. . . . . . . . . . . . . )>     flatten(cust_data.rental_data.Payments)
. . . . . . . . . . . . . )>       payment_data
. . . . . . . . . . . . . )>   FROM films f
. . . . . . . . . . . . . )>     INNER JOIN
. . . . . . . . . . . . . )>    (SELECT c.`First Name` first_name,
. . . . . . . . . . . . . )>       c.`Last Name` last_name,
. . . . . . . . . . . . . )>       flatten(c.Rentals) rental_data
. . . . . . . . . . . . . )>     FROM customers c
. . . . . . . . . . . . . )>    ) cust_data
. . . . . . . . . . . . . )>     ON f._id = cust_data.rental_data.filmID
. . . . . . . . . . . . . )>   WHERE f.Rating IN ('G','PG')
. . . . . . . . . . . . . )>  ) cust_payments
. . . . . . . . . . . . . )> GROUP BY first_name, last_name
. . . . . . . . . . . . . )> HAVING
. . . . . . . . . . . . . )>   sum(cast(cust_payments.payment_data.Amount
. . . . . . . . . . . . . )>         as decimal(4,2))) > 80;
+------------+-----------+--------------+
| first_name | last_name | tot_payments |
+------------+-----------+--------------+
| ELEANOR    | HUNT      | 85.80        |
| GORDON     | ALLARD    | 85.86        |
| CLARA      | SHAW      | 86.83        |
| JACQUELINE | LONG      | 86.82        |
| KARL       | SEAL      | 89.83        |
| PRISCILLA  | LOWE      | 95.80        |
| MONICA     | HICKS     | 85.82        |
| LOUIS      | LEONE     | 95.82        |
| JUNE       | CARROLL   | 88.83        |
| ALICE      | STEWART   | 81.82        |
+------------+-----------+--------------+
10 rows selected (1.658 seconds)

最内层的查询,我将其命名为cust_data,将Rentals列表扁平化,使得cust_payments查询能够连接到films集合,同时扁平化Payments列表。最外层的查询按照客户名对数据进行分组,并应用having子句筛选出在G级或PG级电影上花费80美元或更少的客户。

18.5 使用Drill处理多个数据源

到目前为止,使用Drill连接了存储在同一数据库中的多个数据表,但如果数据存储在不同的数据库中呢?例如,假设客户/租借/付款数据存储在MongoDB中,但电影/演员的分类数据存储在MySQL中。只要配置好Drill连接到这两个数据库,剩下的只需要描述到哪里去查找数据。下面是18.4节中的查询,但不是连接到存储在MongoDB中的films集合,而是指定了存储在MySQL中的film数据表:

apache drill (mongo.sakila)> SELECT first_name, last_name,
. . . . . . . . . . . . . )>   sum(cast(cust_payments.payment_data.Amount
. . . . . . . . . . . . . )>         as decimal(4,2))) tot_payments
. . . . . . . . . . . . . )> FROM
. . . . . . . . . . . . . )>  (SELECT cust_data.first_name,
. . . . . . . . . . . . . )>     cust_data.last_name,
. . . . . . . . . . . . . )>     f.Rating,
. . . . . . . . . . . . . )>     flatten(cust_data.rental_data.Payments)
. . . . . . . . . . . . . )>       payment_data
. . . . . . . . . . . . . )>   FROM mysql.sakila.film f
. . . . . . . . . . . . . )>     INNER JOIN
. . . . . . . . . . . . . )>    (SELECT c.`First Name` first_name,
. . . . . . . . . . . . . )>       c.`Last Name` last_name,
. . . . . . . . . . . . . )>       flatten(c.Rentals) rental_data
. . . . . . . . . . . . . )>     FROM mongo.sakila.customers c
. . . . . . . . . . . . . )>    ) cust_data
. . . . . . . . . . . . . )>     ON f.film_id =
. . . . . . . . . . . . . )>       cast(cust_data.rental_data.filmID as integer)
. . . . . . . . . . . . . )>   WHERE f.rating IN ('G','PG')
. . . . . . . . . . . . . )>  ) cust_payments
. . . . . . . . . . . . . )> GROUP BY first_name, last_name
. . . . . . . . . . . . . )> HAVING
. . . . . . . . . . . . . )>   sum(cast(cust_payments.payment_data.Amount
. . . . . . . . . . . . . )>         as decimal(4,2))) > 80;
+------------+-----------+--------------+
| first_name | last_name | tot_payments |
+------------+-----------+--------------+
| LOUIS      | LEONE     | 95.82        |
| JACQUELINE | LONG      | 86.82        |
| CLARA      | SHAW      | 86.83        |
| ELEANOR    | HUNT      | 85.80        |
| JUNE       | CARROLL   | 88.83        |
| PRISCILLA  | LOWE      | 95.80        |
| ALICE      | STEWART   | 81.82        |
| MONICA     | HICKS     | 85.82        |
| GORDON     | ALLARD    | 85.86        |
| KARL       | SEAL      | 89.83        |
+------------+-----------+--------------+
10 rows selected (1.874 seconds)

因为在同一个查询中使用了多个数据库,所以指定了每个数据表/集合的完整路径,以便清楚数据的来源。这正是Drill真正的优势所在,因为我可以在同一个查询中组合多个来源的数据,而无须将一个来源的数据转换并加载到另一个来源。

18.6 SQL的未来

关系型数据库的未来有些不明朗。过去十年的大数据技术可能会继续发展成熟并获得市场份额。也有可能出现一批超越Hadoop和NoSQL的新技术,并从关系型数据库中夺取更多的市场份额。但是,大多数公司仍在使用关系型数据库运行其核心业务功能,这种现状应该需要很长时间才会改变。

不过,SQL的未来似乎更清晰一些。虽然SQL语言最初是作为一种与关系型数据库中的数据进行交互的机制,而像Apache Drill这种工具更像是一个抽象层,促进了跨各种数据库平台的数据分析。从我的观点来看,这种趋势还将继续,SQL在未来多年内仍然是数据分析和报表生成的关键工具。

本文摘自《SQL学习指南(第3版)》

本书介绍了SQL语言的基础知识以及高级特性,包括SQL基本查询、过滤、多数据表查询、集合、数据操作、分组和聚合、子查询、连接、条件逻辑、事务、索引和约束、视图等内容。同时,为了适应近年来数据库领域的发展变化,本书针对大数据、SQL跨平台数据库服务和数据分析等领域的需求,增加了处理大型数据库的实现策略和扩展技术,以及报表和分析工具等内容。

本书内容循序渐进,每章的主题相对独立,并提供了丰富、可扩展的示例,同时还配备精选练习,有利于读者有效学习和快速掌握SQL语言。本书适合作为数据库应用开发者和数据库管理员的必备入门书,也可供SQL相关从业者查阅和参考。

以上是关于SQL学习指南:SQL和大数据的主要内容,如果未能解决你的问题,请参考以下文章

SQL学习笔记——导论

Spark SQL访问PostgreSQL

传递查询如何在Access中使用SQL Server数据库引擎进行查询

传递查询如何在Access中使用SQL Server数据库引擎进行查询

MySQL 性能调优——SQL 优化

如何编写更好的SQL查询:终极指南-第三部分