Accessing data in Hadoop using dplyr and SQL

Posted payton数据之旅

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Accessing data in Hadoop using dplyr and SQL相关的知识,希望对你有一定的参考价值。

If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.

There are two methods for accessing data in Hadoop using dplyr and SQL.

ODBC

You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBIdplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:

library(odbc)

con <- dbConnect(odbc::odbc(), driver = <driver>, host = <host>, dbname = <dbname>, user = <user>, password = <password>, port = 10000)

tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL

dbDisconnect(con)

Spark

If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:

library(sparklyr)

con <- spark_connect(master = "yarn-client")

tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL

spark_disconnect(con)


转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
















以上是关于Accessing data in Hadoop using dplyr and SQL的主要内容,如果未能解决你的问题,请参考以下文章

Accessing data with MySQL

Accessing data using cursors

《iOS Human Interface Guidelines》——Accessing User Data

[Spring boot] Configuring and Accessing a Data Source

numpy使用[]语法索引二维numpy数组中指定指定列之前所有数据列的数值内容(accessing columns in numpy array before specifc column)

numpy使用[]语法索引二维numpy数组中指定数据列的数值内容(accessing the specific column in numpy array)