数据框合并在熊猫(0.7.3)中创建重复记录

Posted

技术标签:

【中文标题】数据框合并在熊猫(0.7.3)中创建重复记录【英文标题】:Dataframe merge creates duplicate records in pandas (0.7.3) 【发布时间】:2012-12-07 13:14:51 【问题描述】:

当我合并两个格式为 (date, someValue) 的 CSV 文件时,我看到一些重复的记录。

如果我将记录减少到一半,问题就会消失。但是,如果我将两个文件的大小加倍,它会变得更糟。感谢任何帮助!

我的代码:

i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()

total_df = pd.merge(i, e, right_index=False, left_index=False,
                    right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')

(注意:11/15、11/16、12/17、12/18 的重复记录。)

In [7]: total_df
Out[7]:
                  date  Cost  netCost
25 2012-11-15 00:00:00     1        2
26 2012-11-15 00:00:00     1        2
31 2012-11-16 00:00:00     1        2
32 2012-11-16 00:00:00     1        2
37 2012-11-17 00:00:00     1        2
2  2012-11-18 00:00:00     1        2
5  2012-11-19 00:00:00     1        2
8  2012-11-20 00:00:00     1        2
11 2012-11-21 00:00:00     1        2
14 2012-11-22 00:00:00     1        2
17 2012-11-23 00:00:00     1        2
20 2012-11-24 00:00:00     1        2
23 2012-11-25 00:00:00     1        2
29 2012-11-26 00:00:00     1        2
35 2012-11-27 00:00:00     1        2
0  2012-11-28 00:00:00     1        2
3  2012-11-29 00:00:00     1        2
6  2012-11-30 00:00:00     1        2
9  2012-12-01 00:00:00     1        2
12 2012-12-02 00:00:00     1        2
15 2012-12-03 00:00:00     1        2
18 2012-12-04 00:00:00     1        2
21 2012-12-05 00:00:00     1        2
24 2012-12-06 00:00:00     1        2
30 2012-12-07 00:00:00     1        2
36 2012-12-08 00:00:00     1        2
1  2012-12-09 00:00:00     2        2
4  2012-12-10 00:00:00     2        2
7  2012-12-11 00:00:00     2        2
10 2012-12-12 00:00:00     2        2
13 2012-12-13 00:00:00     1        2
16 2012-12-14 00:00:00     2        2
19 2012-12-15 00:00:00     2        2
22 2012-12-16 00:00:00     2        2
27 2012-12-17 00:00:00     1        2
28 2012-12-17 00:00:00     1        2
33 2012-12-18 00:00:00     1        2
34 2012-12-18 00:00:00     1        2

i.csv

date,Cost
2012-11-15 00:00:00,1
2012-11-16 00:00:00,1
2012-11-17 00:00:00,1
2012-11-18 00:00:00,1
2012-11-19 00:00:00,1
2012-11-20 00:00:00,1
2012-11-21 00:00:00,1
2012-11-22 00:00:00,1
2012-11-23 00:00:00,1
2012-11-24 00:00:00,1
2012-11-25 00:00:00,1
2012-11-26 00:00:00,1
2012-11-27 00:00:00,1
2012-11-28 00:00:00,1
2012-11-29 00:00:00,1
2012-11-30 00:00:00,1
2012-12-01 00:00:00,1
2012-12-02 00:00:00,1
2012-12-03 00:00:00,1
2012-12-04 00:00:00,1
2012-12-05 00:00:00,1
2012-12-06 00:00:00,1
2012-12-07 00:00:00,1
2012-12-08 00:00:00,1
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,1
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,1
2012-12-18 00:00:00,1

e.csv

date,netCost
2012-11-15 00:00:00,2
2012-11-16 00:00:00,2
2012-11-17 00:00:00,2
2012-11-18 00:00:00,2
2012-11-19 00:00:00,2
2012-11-20 00:00:00,2
2012-11-21 00:00:00,2
2012-11-22 00:00:00,2
2012-11-23 00:00:00,2
2012-11-24 00:00:00,2
2012-11-25 00:00:00,2
2012-11-26 00:00:00,2
2012-11-27 00:00:00,2
2012-11-28 00:00:00,2
2012-11-29 00:00:00,2
2012-11-30 00:00:00,2
2012-12-01 00:00:00,2
2012-12-02 00:00:00,2
2012-12-03 00:00:00,2
2012-12-04 00:00:00,2
2012-12-05 00:00:00,2
2012-12-06 00:00:00,2
2012-12-07 00:00:00,2
2012-12-08 00:00:00,2
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,2
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,2
2012-12-18 00:00:00,2

【问题讨论】:

你能升级到0.10.0吗?无论如何,我无法重现您的问题。 这看起来像是 0.7.3(和 0.8.0)中的错误。绝对建议升级到最新的稳定版本。 不幸的是,我无法升级到 0.10.0,是的,这看起来确实是一个错误。请参阅上面的解决方法。 是什么阻止您升级?请在 GitHub 上报告任何问题。 【参考方案1】:

这似乎是 pandas 0.7.3 或 numpy 1.6 的错误。仅当合并的列是日期(内部转换为 numpy.datetime64)时才会发生这种情况。我的解决方案是将日期转换为字符串-

def _DatetimeToString(datetime64):
  timestamp = datetime64.astype(long)/1000000000
  return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')

i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
i['date'] = i['date'].map(_DatetimeToString)
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()
i['date'] = i['date'].map(_DatetimeToString)

total_df = pd.merge(i, e, right_index=False, left_index=False,
                    right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')

【讨论】:

【参考方案2】:

我也遇到了这个问题/错误。我没有合并日期时间序列,但是,我在左侧数据框中确实有一个日期时间序列。我的解决方案是重复数据删除:

len(pophist)

2347

pop_merged = pd.merge(left=pophist, right=df_labels, how='left', 
             left_on ='candidate', right_on ='Slug', indicator = True)

pop_merged.shape

3303

pop_merged2 = pop_merged.drop_duplicates() #note dedupping is required due to issue in how pandas handles datetime dtypes on merge.  

len(pop_merged2)

2347

【讨论】:

以上是关于数据框合并在熊猫(0.7.3)中创建重复记录的主要内容,如果未能解决你的问题,请参考以下文章

如何合并两个熊猫数据框[重复]

要合并的大文件。如何防止熊猫合并中的重复?

有效地合并熊猫中的多个数据框[重复]

合并一个值在另外两个之间的熊猫数据框[重复]

合并两个数据框而不重复熊猫

用熊猫读取和合并文件[重复]