如何在没有匹配索引的情况下比较来自两个 DF 的两个日期？

Posted 2023-04-15

技术标签:

【中文标题】如何在没有匹配索引的情况下比较来自两个 DF 的两个日期？【英文标题】：How to compare two dates from two DF without matching index? 【发布时间】：2019-01-23 21:50:47 【问题描述】：

df1
  USERID    DATE
     1       1/1/2018
     1       1/2/2018
     1       1/3/2018
     2       1/2/2018
     2       1/3/2018
     3       1/3/2018

df2
  USERID    DATE
     1       1/1/2018        
     2       1/2/2018         
     3       1/3/2018

我想将date 中的df2 与属于同一USERID 的df1 进行比较，以判断df1 中的行是否也存在于df2 中

Result:
  USERID      DATE       Exists
     1       1/1/2018     True
     1       1/2/2018     False
     1       1/3/2018     False
     2       1/2/2018     True
     2       1/3/2018     False
     3       1/3/2018     True

我想做相当于 np.where((df1['DATE'] == df2['DATE']), True, False) 但是现在返回错误Can only compare identically-labeled Series objects

【问题讨论】：

【参考方案1】：

你可以merge：

# create a new column 
df2['Exists'] = True

df3 = pd.merge(df1,df2,on=['USERID','DATE'],how='outer').fillna(False)

  USERID    DATE    Exists
0   1   1/1/2018    True
1   1   1/2/2018    False
2   1   1/3/2018    False
3   2   1/2/2018    True
4   2   1/3/2018    False
5   3   1/3/2018    True

【讨论】：

【参考方案2】：

看起来您正在尝试执行left join，然后显示df2 为空的新列。

下面是一个改编自this SO answer和this post的例子：

from pyspark.sql import functions as F

# Alias the columns here, to prevent column name collision
df1_alias = df1.alias("first")
df2_alias = df2.alias("second")

# Left join on df1.id = df2.id and df1.date = df2.date
result = df1_alias.join(df2_alias, (df1_alias.id == df2_alias.id) & (df1_alias.date == df2_alias.date), how='left')

# Create a column called 'exists' and set it to true if there's a value defined for df2
result = result.withColumn('exists', F.col("second.id").isNotNull())

# Display just df1 values and the exists column
result.select([F.col("first.id"), F.col("first.name"), F.col("exists")]).show()

【讨论】：

以上是关于如何在没有匹配索引的情况下比较来自两个 DF 的两个日期？的主要内容，如果未能解决你的问题，请参考以下文章