使用 Spark 读取带有 where 子句的 HBase 表

Posted 2023-02-16

技术标签:

【中文标题】使用 Spark 读取带有 where 子句的 HBase 表【英文标题】：Read HBase table with where clause using Spark 【发布时间】：2016-10-17 08:56:25 【问题描述】：

我正在尝试使用 Spark Scala API 读取 HBase 表。

示例代码：

conf.set("hbase.master", "localhost:60000")
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + hBaseRDD.count())

如果我使用newAPIHadoopRDD，如何添加where 子句？

或者我们需要使用任何Spark Hbase Connector 来实现这一点？

我看到了下面的 Spark Hbase 连接器，但我没有看到任何带有 where 子句的示例代码。

https://github.com/nerdammer/spark-hbase-connector

【问题讨论】：

【参考方案1】：

您可以使用 HortonWorks 的 SHC 连接器来实现此目的。

https://github.com/hortonworks-spark/shc

这是 Spark 2 的代码示例。

 val catalog =
        s"""
            |"table":"namespace":"default", "name":"my_table",
            |"rowkey":"id",
            |"columns":
            |"id":"cf":"rowkey", "col":"id", "type":"string",
            |"name":"cf":"info", "col":"name", "type":"string",
            |"age":"cf":"info", "col":"age", "type":"string"
            |
            |""".stripMargin

    val spark = SparkSession
        .builder()
        .appName("hbase spark")
        .getOrCreate()

    val df = spark
        .read
        .options(
            Map(
                HBaseTableCatalog.tableCatalog -> catalog
            )
        )
        .format("org.apache.spark.sql.execution.datasources.hbase")
        .load()

    df.show()

然后，您可以在数据框上使用任何方法。例如：

df.where(df("age") === 20)

【讨论】：

我试了一下，其他字段输出正常，但是cf:rowkey没有输出到目标表，这是SHC的属性吗？

以上是关于使用 Spark 读取带有 where 子句的 HBase 表的主要内容，如果未能解决你的问题，请参考以下文章