如何将列表数组合并到单列中并使其适合现有的数据框?

Posted

技术标签:

【中文标题】如何将列表数组合并到单列中并使其适合现有的数据框?【英文标题】:How to merge the array of lists into single column and fit it to already existing dataframe? 【发布时间】:2017-11-20 23:41:35 【问题描述】:

我是 spark 和 scala 的新手。请帮我解决一下这个。

我有以下输出,我需要生成一个新的数据框,其中所有功能都合并而不是单独的列表。另外,我需要将此数据帧附加到另一个数据帧。我怎样才能在 scala 中做到这一点?

val tab = inter.map(_.groupBy().sum())
tab.map(_.show())

tab: Array[org.apache.spark.sql.DataFrame] = Array([sum(vec_0): double, sum(vec_1): double ... 2 more fields], [sum(vec_0): double, sum(vec_1): double ... 2 more fields])
+------------------+------------------+------------------+------------------+
|        sum(vec_0)|        sum(vec_1)|        sum(vec_2)|        sum(vec_3)|
+------------------+------------------+------------------+------------------+
|2.5046410000000003|2.1487149999999997|1.0884870000000002|3.5877090000000003|
+------------------+------------------+------------------+------------------+
+------------------+------------------+----------+------------------+
|        sum(vec_0)|        sum(vec_1)|sum(vec_2)|        sum(vec_3)|
+------------------+------------------+----------+------------------+
|0.9558040000000001|0.9843780000000002|  0.545025|0.9979860000000002|
+------------------+------------------+----------+------------------+
res325: Array[Unit] = Array((), ())
FINISHED   

    val temp = tab.map(_.alias("t").select(array("t.*") as "List"))
    temp.map(_.toDF().show(false))

    temp: Array[org.apache.spark.sql.DataFrame] = Array([List: array<double>], [List: array<double>])
    +--------------------------------------------------------------------------------+
    |List                                                                            |
    +--------------------------------------------------------------------------------+
    |[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
    +--------------------------------------------------------------------------------+
    +----------------------------------------------------------------------+
    |List                                                                  |
    +----------------------------------------------------------------------+
    |[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
    +----------------------------------------------------------------------+
    res443: Array[Unit] = Array((), ())
val newtable = temp.map(_.toDF("features"))
newtable.map(_.show(false))

     newtable: Array[org.apache.spark.sql.DataFrame] = Array([features: 
array<double>], [features: array<double>])
    +--------------------------------------------------------------------------------+
    |features                                                                        |
    +--------------------------------------------------------------------------------+
    |[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
    +--------------------------------------------------------------------------------+
    +----------------------------------------------------------------------+
    |features                                                              |
    +----------------------------------------------------------------------+
    |[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
    +----------------------------------------------------------------------+
    res328: Array[Unit] = Array((), ())

预期输出:

+--------------------------------------------------------------------------------+
|features                                                                        |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]| 
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
+---------------------------------------------------------------------------------+

【问题讨论】:

试试 flatmap 代替 map,它应该做数组的展平。比如 val newtable = temp.flatMap(_.toDF("features")) 如果我尝试 flatMap,我会收到以下错误。找到:org.apache.spark.sql.DataFrame(扩展为)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 需要:scala.collection.GenTraversableOnce[?] val newtable = temp .flatMap(_.toDF) 您的输入看起来如何(临时变量)?您可以将其添加到问题中吗? 是的,我也同意 Shaido。临时样本应该可以帮助您快速获得解决方案 temp 类似于 newtable....我通过合并列创建了一个列表并将其转换为列表。现在我正在尝试将所有列表放在一列下,以便可以将其附加到数据框。 【参考方案1】:

这解决了问题。

val fList = newtable.reduce(_.union(_))
newtable.show(false



 fList: org.apache.spark.sql.DataFrame = [features: array<double>]
+--------------------------------------------------------------------------------+
|features                                                                        |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]          |
+--------------------------------------------------------------------------------+

【讨论】:

以上是关于如何将列表数组合并到单列中并使其适合现有的数据框?的主要内容,如果未能解决你的问题,请参考以下文章

将多列中的列表合并到熊猫中的单列

如何将 Spark Dataframe 列转换为字符串数组的单列

如何捕获图片框中的图像并使其可以在c#应用程序中下载?

将保存在数据库 sqlite 中的列表视图中的一行发送到另一个列表视图,并使其与第一个列表视图中的行相同

如何将 Pandas Dataframe 中的字符串转换为列表或字符数组?

如何将现有的关系数据库模型转换为适合无 sql 数据库的模型(如 Mongo DB 或 Amazon Dynamo DB)