pd.concat() 和 pd.merge() 之间的区别以及为啥我会得到错误的输出？

Posted 2023-03-11

技术标签:

【中文标题】pd.concat() 和 pd.merge() 之间的区别以及为啥我会得到错误的输出？【英文标题】：Difference between pd.concat() and pd.merge() and why do I get wrong output?pd.concat() 和 pd.merge() 之间的区别以及为什么我会得到错误的输出？ 【发布时间】：2020-03-13 14:32:32 【问题描述】：

我在需要加入两个数据框时遇到困难。我通常应用 pd.merge()。但在这种情况下，我得到一个 ValueError，建议使用 pd.concat()。所以，我的情况是这样的：

我有两个数据框，df1 和 df2，下面是它们的索引。

In [15]: df1.index
Out[15]: 
DatetimeIndex(['2019-11-03 00:00:00', '2019-11-03 01:00:00',
               '2019-11-03 02:00:00', '2019-11-03 03:00:00',
               ...
               '2019-11-12 11:00:00', '2019-11-12 12:00:00',
               '2019-11-12 13:00:00', '2019-11-12 14:00:00'],
              dtype='datetime64[ns]', name='datetime', length=231, freq=None)


In [16]: df2.index
Out[16]: 
Index(['2019-11-03 00:00:00', '2019-11-04 00:00:00',
       '2019-11-05 00:00:00', '2019-11-06 00:00:00',
       '2019-11-07 00:00:00', '2019-11-08 00:00:00',
       '2019-11-09 00:00:00', '2019-11-10 00:00:00',
       '2019-11-11 00:00:00', '2019-11-12 00:00:00'],
      dtype='object', name='datetime')

当我尝试通过 merged=pd.merge(df1, df2, left_on=['datetime'], right_on=['datetime'], how='left') 合并两个数据帧时，我收到一条消息 ValueError: You are trying to merge on datetime64[ns] and object columns. If you wish to proceed you should use pd.concat

请允许我也介绍一下这两个数据框。

temperatures = [c for c in df1 if c.startswith('temp')]
df1['temp_mean']=df1[temperatures].mean(axis=1)

In [6]: df1.head(3)
Out[6]:
                    location  temperature1  temperature2  wind  rain  temp_mean
datetime                                           
2019-10-03 00:00:00       HK        18.72          18.78    SW   0.0      18.75
2019-10-03 01:00:00       HK        18.63          18.67    SW   0.1      18.65
2019-10-03 02:00:00       HK        18.29          18.31    SW   0.3      18.30

In [7]:df2
Out[7]: 
                       values
datetime                     
2019-11-03 00:00:00  0.154286
2019-11-04 00:00:00 -5.094286
2019-11-05 00:00:00  1.432857
2019-11-06 00:00:00  0.227143
2019-11-07 00:00:00  0.160000
2019-11-08 00:00:00  1.300000
2019-11-09 00:00:00  0.308571
2019-11-10 00:00:00  0.442857
2019-11-11 00:00:00  0.241429
2019-11-12 00:00:00       NaN

显然，通过合并两个数据框，我预计 df2 的列“值”将在最后加入 df1，并且任何时候 != '00:00:00' 都会用 NaN 填充，并且这些值会放置在时间 == '00:00:00'。由于我收到错误并建议使用 pd.concat()，因此我键入 concated=pd.concat([df1, df2], axis=1, join='outer', ignore_index=False)，然后在下面的输出中得到“值”列但完全为空（在任何时候我都会得到 NaN）。

In [17]: concated.head(3)
Out[17]:
                    location  temperature1  temperature2  wind  rain  temp_mean  \
datetime                                           
2019-10-03 00:00:00       HK        18.72          18.78    SW   0.0      18.75
2019-10-03 01:00:00       HK        18.63          18.67    SW   0.1      18.65
2019-10-03 02:00:00       HK        18.29          18.31    SW   0.3      18.30

                      values
datetime                                           
2019-10-03 00:00:00      NaN
2019-10-03 01:00:00      NaN
2019-10-03 02:00:00      NaN

我不明白我在这里做错了什么以及我如何才能做到这一点。

一开始，我不明白为什么 pd.merge() 不能处理我的数据框，然后我不明白为什么 pd.concat() 看不到这些值。

此时您的帮助将很有价值，因此在此先感谢您。

【问题讨论】：

【参考方案1】：

我相信你需要merge 和left_index=True 和right_index=True 因为DatetimeIndex 在DataFrames 中都匹配：

#convert to DatetimeIndex
df2.index = pd.to_datetime(df2.index)
df = pd.merge(df1, df2, left_index=True, right_index=True)

【讨论】：

嗯...是的！有效。太感谢了！因此，当日期时间是索引时，我必须在方法中应用特定的参数。【参考方案2】：

您正在尝试合并具有不同数据类型的日期时间列。

df1 : dtype='datetime64[ns]'

df2 : dtype='object'

解决方案：将任一数据类型转换为其他使用， .dt.strftime（转换为字符串）要么 pd.to_datetime（转换为日期时间数据类型）

【讨论】：

以上是关于pd.concat() 和 pd.merge() 之间的区别以及为啥我会得到错误的输出？的主要内容，如果未能解决你的问题，请参考以下文章