在一列中获取火花数据帧的所有非空列
Posted
技术标签:
【中文标题】在一列中获取火花数据帧的所有非空列【英文标题】:Get all Not null columns of spark dataframe in one Column 【发布时间】:2020-06-16 11:26:22 【问题描述】:我需要从 Hive 表中选择所有非空列并将它们插入 Hbase。例如,考虑下表:
Name Place Department Experience
==============================================
Ram | Ramgarh | Sales | 14
Lakshman | Lakshmanpur |Operations |
Sita | Sitapur | | 14
Ravan | | | 25
我必须将上表中的所有非空列写入 Hbase。所以我写了一个逻辑来在数据框的一列中获取非空列,如下所示。那里的名称列是强制性的。
Name Place Department Experience Not_null_columns
================================================================================
Ram Ramgarh Sales 14 Name, Place, Department, Experience
Lakshman Lakshmanpur Operations Name, Place, Department
Sita Sitapur 14 Name, Place, Experience
Ravan 25 Name, Experience
现在我的要求是在数据框中创建一个列,其中所有非空列的值都在一个列中,如下所示。
Name Place Department Experience Not_null_columns_values
Ram Ramgarh Sales 14 Name: Ram, Place: Ramgarh, Department: Sales, Experince: 14
Lakshman Lakshmanpur Operations Name: Lakshman, Place: Lakshmanpur, Department: Operations
Sita Sitapur 14 Name: Sita, Place: Sitapur, Experience: 14
Ravan 25 Name: Ravan, Experience: 25
一旦超过 df,我会将其写入 Hbase,名称为键,最后一列为值。
如果有更好的方法可以做到这一点,请告诉我。
【问题讨论】:
【参考方案1】:试试这个-
加载提供的测试数据
val data =
"""
|Name | Place | Department | Experience
|
|Ram | Ramgarh | Sales | 14
|
|Lakshman | Lakshmanpur |Operations |
|
|Sita | Sitapur | | 14
|
|Ravan | | | 25
""".stripMargin
val stringDS = data.split(System.lineSeparator())
.map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
.toSeq.toDS()
val df = spark.read
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
// .option("nullValue", "null")
.csv(stringDS)
df.show(false)
df.printSchema()
/**
* +--------+-----------+----------+----------+
* |Name |Place |Department|Experience|
* +--------+-----------+----------+----------+
* |Ram |Ramgarh |Sales |14 |
* |Lakshman|Lakshmanpur|Operations|null |
* |Sita |Sitapur |null |14 |
* |Ravan |null |null |25 |
* +--------+-----------+----------+----------+
*
* root
* |-- Name: string (nullable = true)
* |-- Place: string (nullable = true)
* |-- Department: string (nullable = true)
* |-- Experience: integer (nullable = true)
*/
先转换结构再转换成json
val x = df.withColumn("Not_null_columns_values",
to_json(struct(df.columns.map(col): _*)))
x.show(false)
x.printSchema()
/**
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Name |Place |Department|Experience|Not_null_columns_values |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
* |Ram |Ramgarh |Sales |14 |"Name":"Ram","Place":"Ramgarh","Department":"Sales","Experience":14|
* |Lakshman|Lakshmanpur|Operations|null |"Name":"Lakshman","Place":"Lakshmanpur","Department":"Operations" |
* |Sita |Sitapur |null |14 |"Name":"Sita","Place":"Sitapur","Experience":14 |
* |Ravan |null |null |25 |"Name":"Ravan","Experience":25 |
* +--------+-----------+----------+----------+---------------------------------------------------------------------+
*/
【讨论】:
谢谢,这就像一个魅力。 . .仍在试图理解其中的逻辑。 . .作为有火花的新手。以上是关于在一列中获取火花数据帧的所有非空列的主要内容,如果未能解决你的问题,请参考以下文章