Accessing data in Hadoop using dplyr and SQL
Posted payton数据之旅
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Accessing data in Hadoop using dplyr and SQL相关的知识,希望对你有一定的参考价值。
If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr
. The dplyr
package has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr
to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.
There are two methods for accessing data in Hadoop using dplyr
and SQL.
ODBC
You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBI
, dplyr
, and odbc
. Note that the dplyr
package may also reference the dbplyr
package to help translate R into specific variants of SQL. You can use the odbc
package to create a connection with Hadoop and run queries:
library(odbc)
con <- dbConnect(odbc::odbc(),
driver = <driver>,
host = <host>,
dbname = <dbname>,
user = <user>,
password = <password>,
port = 10000)
tbl(con, "mytable") # dplyr
dbGetQuery(con, "SELECT * FROM mytable") # SQL
dbDisconnect(con)
Spark
If you are running Spark on Hadoop, you may also elect to use the sparklyr
package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr
package communicates with the Spark API to run SQL queries, and it also has a dplyr
backend. You can use sparklyr
to create a connect with Spark run queries:
library(sparklyr)
dbGetQuery(con, "SELECT * FROM mytable") # SQL
con <- spark_connect(master = "yarn-client")
tbl(con, "mytable") # dplyr
spark_disconnect(con)
转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL
以上是关于Accessing data in Hadoop using dplyr and SQL的主要内容,如果未能解决你的问题,请参考以下文章
《iOS Human Interface Guidelines》——Accessing User Data
[Spring boot] Configuring and Accessing a Data Source
numpy使用[]语法索引二维numpy数组中指定指定列之前所有数据列的数值内容(accessing columns in numpy array before specifc column)
numpy使用[]语法索引二维numpy数组中指定数据列的数值内容(accessing the specific column in numpy array)