如何将列表数组合并到单列中并使其适合现有的数据框?
Posted
技术标签:
【中文标题】如何将列表数组合并到单列中并使其适合现有的数据框?【英文标题】:How to merge the array of lists into single column and fit it to already existing dataframe? 【发布时间】:2017-11-20 23:41:35 【问题描述】:我是 spark 和 scala 的新手。请帮我解决一下这个。
我有以下输出,我需要生成一个新的数据框,其中所有功能都合并而不是单独的列表。另外,我需要将此数据帧附加到另一个数据帧。我怎样才能在 scala 中做到这一点?
val tab = inter.map(_.groupBy().sum())
tab.map(_.show())
tab: Array[org.apache.spark.sql.DataFrame] = Array([sum(vec_0): double, sum(vec_1): double ... 2 more fields], [sum(vec_0): double, sum(vec_1): double ... 2 more fields])
+------------------+------------------+------------------+------------------+
| sum(vec_0)| sum(vec_1)| sum(vec_2)| sum(vec_3)|
+------------------+------------------+------------------+------------------+
|2.5046410000000003|2.1487149999999997|1.0884870000000002|3.5877090000000003|
+------------------+------------------+------------------+------------------+
+------------------+------------------+----------+------------------+
| sum(vec_0)| sum(vec_1)|sum(vec_2)| sum(vec_3)|
+------------------+------------------+----------+------------------+
|0.9558040000000001|0.9843780000000002| 0.545025|0.9979860000000002|
+------------------+------------------+----------+------------------+
res325: Array[Unit] = Array((), ())
FINISHED
val temp = tab.map(_.alias("t").select(array("t.*") as "List"))
temp.map(_.toDF().show(false))
temp: Array[org.apache.spark.sql.DataFrame] = Array([List: array<double>], [List: array<double>])
+--------------------------------------------------------------------------------+
|List |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
+--------------------------------------------------------------------------------+
+----------------------------------------------------------------------+
|List |
+----------------------------------------------------------------------+
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
+----------------------------------------------------------------------+
res443: Array[Unit] = Array((), ())
val newtable = temp.map(_.toDF("features"))
newtable.map(_.show(false))
newtable: Array[org.apache.spark.sql.DataFrame] = Array([features:
array<double>], [features: array<double>])
+--------------------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
+--------------------------------------------------------------------------------+
+----------------------------------------------------------------------+
|features |
+----------------------------------------------------------------------+
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
+----------------------------------------------------------------------+
res328: Array[Unit] = Array((), ())
预期输出:
+--------------------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002]|
+---------------------------------------------------------------------------------+
【问题讨论】:
试试 flatmap 代替 map,它应该做数组的展平。比如 val newtable = temp.flatMap(_.toDF("features")) 如果我尝试 flatMap,我会收到以下错误。找到:org.apache.spark.sql.DataFrame(扩展为)org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] 需要:scala.collection.GenTraversableOnce[?] val newtable = temp .flatMap(_.toDF) 您的输入看起来如何(临时变量)?您可以将其添加到问题中吗? 是的,我也同意 Shaido。临时样本应该可以帮助您快速获得解决方案 temp 类似于 newtable....我通过合并列创建了一个列表并将其转换为列表。现在我正在尝试将所有列表放在一列下,以便可以将其附加到数据框。 【参考方案1】:这解决了问题。
val fList = newtable.reduce(_.union(_))
newtable.show(false
fList: org.apache.spark.sql.DataFrame = [features: array<double>]
+--------------------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------------------+
|[2.5046410000000003, 2.1487149999999997, 1.0884870000000002, 3.5877090000000003]|
|[0.9558040000000001, 0.9843780000000002, 0.545025, 0.9979860000000002] |
+--------------------------------------------------------------------------------+
【讨论】:
以上是关于如何将列表数组合并到单列中并使其适合现有的数据框?的主要内容,如果未能解决你的问题,请参考以下文章
如何将 Spark Dataframe 列转换为字符串数组的单列
将保存在数据库 sqlite 中的列表视图中的一行发送到另一个列表视图,并使其与第一个列表视图中的行相同
如何将 Pandas Dataframe 中的字符串转换为列表或字符数组?
如何将现有的关系数据库模型转换为适合无 sql 数据库的模型(如 Mongo DB 或 Amazon Dynamo DB)