spark sql Dataframe 的 unionreducereduce(_ union _)
Posted nefu-ljw
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark sql Dataframe 的 unionreducereduce(_ union _)相关的知识,希望对你有一定的参考价值。
union函数
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val join12 = df1.union(df2)
join12.show(false)
// join12结果
+----+----+----+
|col0|col1|col2|
+----+----+----+
|1 |2 |3 |
|1 |2 |3 |
+----+----+----+
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
val df3 = Seq((1, 2, 3)).toDF("col1", "col2", "col0")
val join12 = df1.union(df2)
val join123 = join12.union(df3)
// join123结果
+----+----+----+
|col0|col1|col2|
+----+----+----+
|1 |2 |3 |
|4 |5 |6 |
|1 |2 |3 |
+----+----+----+
union返回一个新的数据集,其中包含此数据集中的行和另一个数据集中的行的并集。
这相当于 SQL 中的 UNION ALL。 要执行 SQL 样式的集合并集(对元素进行重复数据删除),请使用此函数,后跟一个不同的。
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.
同样作为 SQL 中的标准,此函数按位置(而不是按名称)解析列:
val df1 = Seq((1, 2, 3)).toDF(“col0”, “col1”, “col2”)
val df2 = Seq((4, 5, 6)).toDF(“col1”, “col2”, “col0”)
df1.union(df2).show
// 输出:
// ±—±—±—+
// |col0|col1|col2|
// ±—±—±—+
// | 1| 2| 3|
// | 4| 5| 6|
// ±—±—±—+
请注意,schema中的列位置不一定与数据集中强类型对象中的字段匹配。 此函数根据列在schema中的位置而不是强类型对象中的字段来解析列。 使用 unionByName 按类型对象中的字段名称解析列。
reduce函数
调用reduceLeft函数
Applies a binary operator to all elements of this traversable or iterator, going left to right. Note: will not terminate for infinite-sized collections. Note: might return different results for different runs, unless the underlying collection type is ordered or the operator is associative and commutative.
形参:
op – the binary operator.
类型形参:
B – the result type of the binary operator.
返回:
the result of inserting op between consecutive elements of this traversable or iterator, going left to right:
op( op( … op(x_1, x_2) …, x_n-1), x_n)
where x,1, …, x,n, are the elements of this traversable or iterator.
抛出:
UnsupportedOperationException – if this traversable or iterator is empty.
reduce(_ union _) 示例
https://stackoverflow.com/a/37612978/17434375
val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")
val dfs = Seq(df1, df2, df3)
val result_df = dfs.reduce(_ union _) // reduce(x,y) => (x union y)
result_df.show(false)
+---+----+
|id |x |
+---+----+
|1 |10 |
|2 |20 |
|3 |30 |
|4 |40 |
|1 |100 |
|2 |200 |
|3 |300 |
|4 |400 |
|1 |1000|
|2 |2000|
|3 |3000|
|4 |4000|
+---+----+
那么dfs.reduce(_ union _)
实际上就是左结合之前合并过的dataframe 1~N-1和正在合并的dataframe N,比如3个df就是union(union(df1,df2),df3)
以上是关于spark sql Dataframe 的 unionreducereduce(_ union _)的主要内容,如果未能解决你的问题,请参考以下文章
值 createGlobalTempView 不是 apache.org.spark.sql.DataFrame 的成员