如何检查属于两个数据框的行的差异
Posted
技术标签:
【中文标题】如何检查属于两个数据框的行的差异【英文标题】:how to check differences in rows belonging to two dataframes 【发布时间】:2016-04-09 08:40:46 【问题描述】:我有两个数据框,它们代表同一个人的两个不同时期。我想了解,对于每一行,两个数据框的第 5(固定)列是否有任何变化。
之前:
+--+------+------+------+------+------+------+
|id| sport| var1| var2| var3| var4| var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234| | | | |
| 2|soccer| null| null| null| null| null|
| 3|soccer|330101| | | | |
| 4|soccer| null| null| null| null| null|
| 5|soccer| null| null| null| null| null|
| 6|soccer| null| null| null| null| null|
| 7|soccer| null| null| null| null| null|
| 8|soccer|330024|330401| | | |
| 9|soccer|330055|330106| | | |
|10|soccer| null| null| null| null| null|
|11|soccer|390027| | | | |
|12|soccer| null| null| null| null| null|
|13|soccer|330101| | | | |
|14|soccer|330059| | | | |
|15|soccer| null| null| null| null| null|
|16|soccer|140242|140281| | | |
|17|soccer|330214| | | | |
|18|soccer| | | | | |
|19|soccer|330055|330196| | | |
|20|soccer|210022| | | | |
+--+------+------+------+------+------+------+
之后:
+--+------+------+------+------+------+------+
|id| sport| var1| var2| var3| var4| var5|
+--+------+------+------+------+------+------+
| 1|soccer|330234| | | | |
| 2|soccer| null| null| null| null| null|
| 3|soccer|330101| | | | |
| 4|soccer| null| null| null| null| null|
| 5|soccer| null| null| null| null| null|
| 6|soccer| null| null| null| null| null|
| 7|soccer| null| null| null| null| null|
| 8|soccer| null| null| null| null| null|
| 9|soccer|330106| | | | |
|10|soccer| null| null| null| null| null|
|11|soccer|390027| | | | |
|12|soccer| null| null| null| null| null|
|13|soccer| null| null| null| null| null|
|14|soccer|330128|330331|330106|330059| |
|15|soccer| null| null| null| null| null|
|16|soccer|140242|140281|140010| | |
|17|soccer|330214| | | | |
|18|soccer| null| null| null| null| null|
|19|soccer|330196| | | | |
|20|soccer|210022| | | | |
+--+------+------+------+------+------+------+
我知道如何扫描属于一行的列之间的差异,但我很不知道如何比较两个不同数据帧的行。
理想的输出是:
+--+------+------+
|id| sport| diff|
+--+------+------+
| 1|soccer| 0|
| 2|soccer| 0|
| 3|soccer| 0|
| 4|soccer| 0|
| 5|soccer| 0|
| 6|soccer| 0|
| 7|soccer| 0|
| 8|soccer| 1|
| 9|soccer| 1|
|10|soccer| 0|
|11|soccer| 0|
|12|soccer| 0|
|13|soccer| 1|
|14|soccer| 1|
|15|soccer| 0|
|16|soccer| 1|
|17|soccer| 0|
|18|soccer| 0|
|19|soccer| 1|
|20|soccer| 0|
【问题讨论】:
【参考方案1】:你的意思是这样的吗?让我们从示例数据开始:
val before = Seq(
(1, "soccer", Some(1), Some(2), Some(3), Some(4), None),
(2, "soccer", None, Some(0), None, None, Some(0)),
(3, "soccer", None, None, None, None, None)
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
val after = Seq(
(1, "soccer", Some(1), Some(2), Some(3), Some(4), None), // Zero diffs
(2, "soccer", Some(1), Some(0), None, None, Some(0)), // One diff
(3, "soccer", Some(1), Some(1), Some(1), Some(1), Some(1)) // Five diffs
).toDF("id", "sport", "var1", "var2", "var3", "var4", "var5")
生成计算差异的表达式:
// Extract var columns
val varCols = before.columns.drop(2)
// Generate a list of exprs
// CAST(NOT(before.var1 <=> after.var1) AS INT)
val equalsExprs = varCols.map(
c => not(col(s"before.$c") <=> col(s"after.$c")).cast("int").alias(s"$c_ne"))
// SUM
val diff = equalsExprs.foldLeft(lit(0))(_ + _).alias("diff")
它会处理:
两个 NULL 相等 任何值和 NULL 不相等 两个非 NULL 值 - 标准类型相等加入并选择表达式:
val diffs = before.as("before").join(after.as("after"), Seq("id", "sport"))
.select($"id", $"sport", diff)
diffs.show
// +---+------+----+
// | id| sport|diff|
// +---+------+----+
// | 1|soccer| 0|
// | 2|soccer| 1|
// | 3|soccer| 5|
// +---+------+----+
【讨论】:
我想知道是否可以编写一个表达式,不仅计算差异,而且了解这些差异是对当前状态的加法还是减法。说之前我有Some(1), Some(2), None, None, None
和之后像Some(1), Some(2), Some(3), Some(4), None
与之后像None, None, None, None, None
... 这两个变化,但在第一种情况下它是+2,而在第二种情况下是-2以上是关于如何检查属于两个数据框的行的差异的主要内容,如果未能解决你的问题,请参考以下文章