在一列中获取火花数据帧的所有非空列

Posted

技术标签:

【中文标题】在一列中获取火花数据帧的所有非空列【英文标题】:Get all Not null columns of spark dataframe in one Column 【发布时间】:2020-06-16 11:26:22 【问题描述】:

我需要从 Hive 表中选择所有非空列并将它们插入 Hbase。例如,考虑下表:

Name      Place         Department  Experience
==============================================
Ram      | Ramgarh      |  Sales      |  14
Lakshman | Lakshmanpur  |Operations   | 
Sita     | Sitapur      |             |  14
Ravan    |              |             |  25

我必须将上表中的所有非空列写入 Hbase。所以我写了一个逻辑来在数据框的一列中获取非空列,如下所示。那里的名称列是强制性的。

Name        Place       Department  Experience      Not_null_columns
================================================================================
Ram         Ramgarh     Sales        14            Name, Place, Department, Experience
Lakshman    Lakshmanpur Operations                 Name, Place, Department
Sita        Sitapur                  14            Name, Place, Experience
Ravan                                25            Name, Experience

现在我的要求是在数据框中创建一个列,其中所有非空列的值都在一个列中,如下所示。

Name      Place        Department   Experience    Not_null_columns_values
Ram       Ramgarh      Sales        14           Name: Ram, Place: Ramgarh, Department: Sales, Experince: 14
Lakshman  Lakshmanpur  Operations                Name:    Lakshman, Place: Lakshmanpur, Department: Operations
Sita      Sitapur                   14           Name:    Sita, Place: Sitapur, Experience: 14
Ravan                               25           Name:    Ravan, Experience: 25

一旦超过 df,我会将其写入 Hbase,名称为键,最后一列为值。

如果有更好的方法可以做到这一点,请告诉我。

【问题讨论】:

【参考方案1】:

试试这个-

加载提供的测试数据

    val data =
      """
        |Name    |  Place    |     Department | Experience
        |
        |Ram      | Ramgarh      |  Sales      |  14
        |
        |Lakshman | Lakshmanpur  |Operations   |
        |
        |Sita     | Sitapur      |             |  14
        |
        |Ravan   |              |              |  25
      """.stripMargin

    val stringDS = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
//      .option("nullValue", "null")
      .csv(stringDS)

    df.show(false)
    df.printSchema()
    /**
      * +--------+-----------+----------+----------+
      * |Name    |Place      |Department|Experience|
      * +--------+-----------+----------+----------+
      * |Ram     |Ramgarh    |Sales     |14        |
      * |Lakshman|Lakshmanpur|Operations|null      |
      * |Sita    |Sitapur    |null      |14        |
      * |Ravan   |null       |null      |25        |
      * +--------+-----------+----------+----------+
      *
      * root
      * |-- Name: string (nullable = true)
      * |-- Place: string (nullable = true)
      * |-- Department: string (nullable = true)
      * |-- Experience: integer (nullable = true)
      */

先转换结构再转换成json

    val x = df.withColumn("Not_null_columns_values",
      to_json(struct(df.columns.map(col): _*)))
    x.show(false)
    x.printSchema()

    /**
      * +--------+-----------+----------+----------+---------------------------------------------------------------------+
      * |Name    |Place      |Department|Experience|Not_null_columns_values                                              |
      * +--------+-----------+----------+----------+---------------------------------------------------------------------+
      * |Ram     |Ramgarh    |Sales     |14        |"Name":"Ram","Place":"Ramgarh","Department":"Sales","Experience":14|
      * |Lakshman|Lakshmanpur|Operations|null      |"Name":"Lakshman","Place":"Lakshmanpur","Department":"Operations"  |
      * |Sita    |Sitapur    |null      |14        |"Name":"Sita","Place":"Sitapur","Experience":14                    |
      * |Ravan   |null       |null      |25        |"Name":"Ravan","Experience":25                                     |
      * +--------+-----------+----------+----------+---------------------------------------------------------------------+
      */

【讨论】:

谢谢,这就像一个魅力。 . .仍在试图理解其中的逻辑。 . .作为有火花的新手。

以上是关于在一列中获取火花数据帧的所有非空列的主要内容,如果未能解决你的问题,请参考以下文章

如何根据另一列的值创建空列或非空列?

Scala DataFrame,将非空列的值复制到新列中

如何使用 MYSQL 中的连接从多个表中获取多个列并在非空列上显示数据以及在空列上显示 null 或零

左连接 ON 非空列不能选择非空列

使用 case 语句计算非空列的数量

数据库中创建表(包括创建主键,外键,非空列,唯一)