spark sql Dataframe 的 unionreducereduce(_ union _)

Posted nefu-ljw

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了spark sql Dataframe 的 unionreducereduce(_ union _)相关的知识,希望对你有一定的参考价值。

union函数

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val join12 = df1.union(df2)
join12.show(false)
// join12结果
+----+----+----+
|col0|col1|col2|
+----+----+----+
|1   |2   |3   |
|1   |2   |3   |
+----+----+----+

val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
val df3 = Seq((1, 2, 3)).toDF("col1", "col2", "col0")
val join12 = df1.union(df2)
val join123 = join12.union(df3)
// join123结果
+----+----+----+
|col0|col1|col2|
+----+----+----+
|1   |2   |3   |
|4   |5   |6   |
|1   |2   |3   |
+----+----+----+

union返回一个新的数据集,其中包含此数据集中的行和另一个数据集中的行的并集。
这相当于 SQL 中的 UNION ALL。 要执行 SQL 样式的集合并集(对元素进行重复数据删除),请使用此函数,后跟一个不同的。

This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

同样作为 SQL 中的标准,此函数按位置(而不是按名称)解析列:

val df1 = Seq((1, 2, 3)).toDF(“col0”, “col1”, “col2”)
val df2 = Seq((4, 5, 6)).toDF(“col1”, “col2”, “col0”)
df1.union(df2).show

// 输出:
// ±—±—±—+
// |col0|col1|col2|
// ±—±—±—+
// | 1| 2| 3|
// | 4| 5| 6|
// ±—±—±—+
请注意,schema中的列位置不一定与数据集中强类型对象中的字段匹配。 此函数根据列在schema中的位置而不是强类型对象中的字段来解析列。 使用 unionByName 按类型对象中的字段名称解析列。

reduce函数

调用reduceLeft函数

Applies a binary operator to all elements of this traversable or iterator, going left to right. Note: will not terminate for infinite-sized collections. Note: might return different results for different runs, unless the underlying collection type is ordered or the operator is associative and commutative.
形参:
op – the binary operator.
类型形参:
B – the result type of the binary operator.
返回:
the result of inserting op between consecutive elements of this traversable or iterator, going left to right:
op( op( … op(x_1, x_2) …, x_n-1), x_n)
where x,1, …, x,n, are the elements of this traversable or iterator.
抛出:
UnsupportedOperationException – if this traversable or iterator is empty.

reduce(_ union _) 示例

https://stackoverflow.com/a/37612978/17434375

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")
val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y")
val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z")

val dfs = Seq(df1, df2, df3)
val result_df = dfs.reduce(_ union _) // reduce(x,y) => (x union y)
result_df.show(false)

+---+----+
|id |x   |
+---+----+
|1  |10  |
|2  |20  |
|3  |30  |
|4  |40  |
|1  |100 |
|2  |200 |
|3  |300 |
|4  |400 |
|1  |1000|
|2  |2000|
|3  |3000|
|4  |4000|
+---+----+

那么dfs.reduce(_ union _)实际上就是左结合之前合并过的dataframe 1~N-1和正在合并的dataframe N,比如3个df就是union(union(df1,df2),df3)

以上是关于spark sql Dataframe 的 unionreducereduce(_ union _)的主要内容,如果未能解决你的问题,请参考以下文章

Spark SQL中的DataFrame的创建

java的怎么操作spark的dataframe

Spark-SQL之DataFrame操作大全

值 createGlobalTempView 不是 apache.org.spark.sql.DataFrame 的成员

DataFrame编程模型初谈与Spark SQL

DataFrame DataSet Spark SQL学习