如何在 Spark 中将两个 DataFrame 与组合列连接起来?

Posted

技术标签:

【中文标题】如何在 Spark 中将两个 DataFrame 与组合列连接起来?【英文标题】:How to join two DataFrame with combined columns in Spark? 【发布时间】:2019-01-18 08:39:48 【问题描述】:

我不明白如何将这样的 2 个 DataFrame 相互连接。

第一个DataFrame存储用户向服务中心请求时间的信息。

我们称这个DataFrame为df1

+-----------+---------------------+
| USER_NAME | REQUEST_DATE        |
+-----------+---------------------+
| Alex      | 2018-03-01 00:00:00 |
| Alex      | 2018-09-01 00:00:00 |
| Bob       | 2018-03-01 00:00:00 |
| Mark      | 2018-02-01 00:00:00 |
| Mark      | 2018-07-01 00:00:00 |
| Kate      | 2018-02-01 00:00:00 |
+-----------+---------------------+

Second DataFrame 存储有关用户可以使用服务中心服务的可能期限(许可期限)的信息。

我们称之为df2

+-----------+---------------------+---------------------+------------+
| USER_NAME | START_SERVICE       | END_SERVICE         | STATUS     |
+-----------+---------------------+---------------------+------------+
| Alex      | 2018-01-01 00:00:00 | 2018-06-01 00:00:00 | Active     |
| Bob       | 2018-01-01 00:00:00 | 2018-02-01 00:00:00 | Not Active |
| Mark      | 2018-01-01 00:00:00 | 2018-05-01 23:59:59 | Active     |
| Mark      | 2018-05-01 00:00:00 | 2018-08-01 23:59:59 | VIP        |
+-----------+---------------------+---------------------+------------+

如何加入这 2 个 DataFrame 并返回这样的结果?治疗时如何获取用户许可类型列表?

+-----------+---------------------+----------------+
| USER_NAME | REQUEST_DATE        | STATUS         |
+-----------+---------------------+----------------+
| Alex      | 2018-03-01 00:00:00 | Active         |
| Alex      | 2018-09-01 00:00:00 | No information |
| Bob       | 2018-03-01 00:00:00 | Not Active     |
| Mark      | 2018-02-01 00:00:00 | Active         |
| Mark      | 2018-07-01 00:00:00 | VIP            |
| Kate      | 2018-02-01 00:00:00 | No information |
+-----------+---------------------+----------------+

代码:

import org.apache.spark.sql.DataFrame

val df1: DataFrame  = Seq(
    ("Alex", "2018-03-01 00:00:00"),
    ("Alex", "2018-09-01 00:00:00"),
    ("Bob", "2018-03-01 00:00:00"),
    ("Mark", "2018-02-01 00:00:00"),
    ("Mark", "2018-07-01 00:00:00"),
    ("Kate", "2018-07-01 00:00:00")
).toDF("USER_NAME", "REQUEST_DATE")

df1.show()

val df2: DataFrame  = Seq(
    ("Alex", "2018-01-01 00:00:00", "2018-06-01 00:00:00", "Active"),
    ("Bob", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "Not Active"),
    ("Mark", "2018-01-01 00:00:00", "2018-05-01 23:59:59", "Active"),
    ("Mark", "2018-05-01 00:00:00", "2018-08-01 23:59:59", "Active")
).toDF("USER_NAME", "START_SERVICE", "END_SERVICE", "STATUS")

df1.show()

val total = df1.join(df2, df1("USER_NAME")===df2("USER_NAME"), "left").filter(df1("REQUEST_DATE") >= df2("START_SERVICE") && df1("REQUEST_DATE") <= df2("END_SERVICE")).select(df1("*"), df2("STATUS"))

total.show()

错误

org.apache.spark.SparkException: Exception thrown in awaitResult:
  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
  at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:232)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:102)
  at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
  at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
  at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
  at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
  at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:85)
  at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:206)
  at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
  at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:354)
  at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:383)
  at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
  at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
  at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:354)
  at org.apache.spark.sql.execution.FilterExec.doProduce(basicPhysicalOperators.scala:125)
  at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
  at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
  at org.apache.spark.sql.execution.FilterExec.produce(basicPhysicalOperators.scala:85)
  at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:45)
  at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
  at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
  at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:35)
  at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:97)
  at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)

【问题讨论】:

【参考方案1】:

如何加入这 2 个 DataFrame 并返回这样的结果?

df_joined = df1.join(df2, Seq('USER_NAME'), 'left' )

如何获取许可证仍然相关的所有用户的列表?

df_relevant = df_joined
.withColumn('STATUS', when(col('REQUEST_DATE') > col('START_SERVICE') and col('REQUEST_DATE') < col('END_SERVICE'), col('STATUS')).otherwise('No information') 
.select('USER_NAME', 'REQUEST_DATE', 'STATUS' )

【讨论】:

您好!谢谢您的回答。你能再看看我的帖子吗?我更改了 DataFrame 信息。在这种情况下你有什么建议? 您好,这个答案很好。除非您没有指定您正在使用的联接类型。默认情况下,大概是left或者inner,我不知道也不想知道,总是指定比较好。我认为这在这里可能并不重要,但由于@NurzhanNogerbek 似乎是新加入的,我认为指出它会很好。请访问***.com/questions/45990633/… 以获取有关sparkjoin 的更多信息。 大家可以再看看我的帖子吗?我添加代码。当我尝试加入时它会引发错误。我哪里做错了? 感谢 cmets,根据新问题更改了我的答案。现在无法运行,但应该可以运行。【参考方案2】:

您正在比较不正确的字符串值的 =。在进行此类比较之前,您应该将它们转换为时间戳。下面的代码对我有用。

顺便说一句..您使用的过滤条件没有给出您在问题中发布的结果。请再次检查。

scala> val df= Seq(("Alex","2018-03-01 00:00:00"),("Alex","2018-09-01 00:00:00"),("Bob","2018-03-01 00:00:00"),("Mark","2018-02-01 00:00:00"),("Mark","2018-07-01 00:00:00"),("Kate","2018-02-01 00:00:00")).toDF("USER_NAME","REQUEST_DATE").withColumn("REQUEST_DATE",to_timestamp('REQUEST_DATE))
df: org.apache.spark.sql.DataFrame = [USER_NAME: string, REQUEST_DATE: timestamp]

scala> df.printSchema
root
 |-- USER_NAME: string (nullable = true)
 |-- REQUEST_DATE: timestamp (nullable = true)


scala> df.show(false)
+---------+-------------------+
|USER_NAME|REQUEST_DATE       |
+---------+-------------------+
|Alex     |2018-03-01 00:00:00|
|Alex     |2018-09-01 00:00:00|
|Bob      |2018-03-01 00:00:00|
|Mark     |2018-02-01 00:00:00|
|Mark     |2018-07-01 00:00:00|
|Kate     |2018-02-01 00:00:00|
+---------+-------------------+


scala> val df2 = Seq(( "Alex","2018-01-01 00:00:00","2018-06-01 00:00:00","Active"),("Bob","2018-01-01 00:00:00","2018-02-01 00:00:00","Not Active"),("Mark","2018-01-01 00:00:00","2018-05-01 23:59:59","Active"),("Mark","2018-05-01 00:00:00","2018-08-01 23:59:59","VIP")).toDF("USER_NAME","START_SERVICE","END_SERVICE","STATUS").withColumn("START_SERVICE",to_timestamp('START_SERVICE)).withColumn("END_SERVICE",to_timestamp('END_SERVICE))
df2: org.apache.spark.sql.DataFrame = [USER_NAME: string, START_SERVICE: timestamp ... 2 more fields]

scala> df2.printSchema
root
 |-- USER_NAME: string (nullable = true)
 |-- START_SERVICE: timestamp (nullable = true)
 |-- END_SERVICE: timestamp (nullable = true)
 |-- STATUS: string (nullable = true)


scala> df2.show(false)
+---------+-------------------+-------------------+----------+
|USER_NAME|START_SERVICE      |END_SERVICE        |STATUS    |
+---------+-------------------+-------------------+----------+
|Alex     |2018-01-01 00:00:00|2018-06-01 00:00:00|Active    |
|Bob      |2018-01-01 00:00:00|2018-02-01 00:00:00|Not Active|
|Mark     |2018-01-01 00:00:00|2018-05-01 23:59:59|Active    |
|Mark     |2018-05-01 00:00:00|2018-08-01 23:59:59|VIP       |
+---------+-------------------+-------------------+----------+


scala> df.join(df2,Seq("USER_NAME"),"leftOuter").filter(" REQUEST_DATE >= START_SERVICE and REQUEST_DATE <= END_SERVICE").select(df("*"),df2("status")).show(false)
+---------+-------------------+------+
|USER_NAME|REQUEST_DATE       |status|
+---------+-------------------+------+
|Alex     |2018-03-01 00:00:00|Active|
|Mark     |2018-02-01 00:00:00|Active|
|Mark     |2018-07-01 00:00:00|VIP   |
+---------+-------------------+------+


scala>

【讨论】:

以上是关于如何在 Spark 中将两个 DataFrame 与组合列连接起来?的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Spark 中将 JavaPairInputDStream 转换为 DataSet/DataFrame

如何在 Scala(Spark 2.0)中将带有字符串的 DataFrame 转换为带有 Vectors 的 DataFrame

如何在 pyspark 中将 DenseMatrix 转换为 spark DataFrame?

如何在 spark DataFrame 中将多个浮点列连接到一个 ArrayType(FloatType()) 中?

如何在 Spark DataFrame/DataSet 中将行拆分为不同的列?

如何在 Spark SQL 中将额外参数传递给 UDF?