要合并的大文件。如何防止熊猫合并中的重复？

Posted 2023-03-11

技术标签:

【中文标题】要合并的大文件。如何防止熊猫合并中的重复？【英文标题】：Large file to merge. How to do prevent duplicates in merge in pandas? 【发布时间】：2016-06-03 04:11:43 【问题描述】：

我有两个数据框，它们在合并时会创建一个 50 GB 的文件，这对于 python 来说太大了，无法处理。我什至无法在 python 中合并，而必须在 SQLite 中进行。

这是两个数据集的样子

第一个数据集：

        a_id c_consumed
    0    sam        oil
    1    sam      bread
    2    sam       soap
    3  harry      shoes
    4  harry        oil
    5  alice       eggs
    6  alice        pen
    7  alice    eggroll

生成此数据集的代码

    df = pd.DataFrame('a_id': 'sam sam sam harry harry alice alice alice'.split(),
               'c_consumed': 'oil bread soap shoes oil eggs pen eggroll'.split())

第二个数据集：

       a_id b_received brand_id type_received       date
   0    sam       soap     bill       edibles 2011-01-01
   1    sam        oil    chris       utility 2011-01-02
   2    sam      brush      dan       grocery 2011-01-01
   3  harry        oil    chris      clothing 2011-01-02
   4  harry      shoes    nancy       edibles 2011-01-03
   5  alice       beer    peter     breakfast 2011-01-03
   6  alice      brush      dan      cleaning 2011-01-02
   7  alice       eggs     jaju       edibles 2011-01-03

生成此数据集的代码：

  df_id = pd.DataFrame('a_id': 'sam sam sam harry harry alice alice alice'.split(),
                  'b_received': 'soap oil brush oil shoes beer brush eggs'.split(),
                  'brand_id': 'bill chris dan chris nancy peter dan jaju'.split(),
                  'type_received': 'edibles utility grocery clothing edibles breakfast cleaning edibles'.split())
 date3 = ['2011-01-01','2011-01-02','2011-01-01','2011-01-02','2011-01-03','2011-01-03','2011-01-02','2011-01-03']
 date3 = pd.to_datetime(date3)
 df_id['date']= date3

我使用此代码来合并数据集

 combined = pd.merge(df_id,df,on='a_id',how='left')

这是生成的数据集

      a_id b_received brand_id type_received       date c_consumed
 0     sam       soap     bill       edibles 2011-01-01        oil
 1     sam       soap     bill       edibles 2011-01-01      bread
 2     sam       soap     bill       edibles 2011-01-01       soap
 3     sam        oil    chris       utility 2011-01-02        oil
 4     sam        oil    chris       utility 2011-01-02      bread
 5     sam        oil    chris       utility 2011-01-02       soap
 6     sam      brush      dan       grocery 2011-01-01        oil
 7     sam      brush      dan       grocery 2011-01-01      bread
 8     sam      brush      dan       grocery 2011-01-01       soap
 9   harry        oil    chris      clothing 2011-01-02      shoes
10  harry        oil    chris      clothing 2011-01-02        oil
11  harry      shoes    nancy       edibles 2011-01-03      shoes
12  harry      shoes    nancy       edibles 2011-01-03        oil
13  alice       beer    peter     breakfast 2011-01-03       eggs
14  alice       beer    peter     breakfast 2011-01-03        pen
15  alice       beer    peter     breakfast 2011-01-03    eggroll
16  alice      brush      dan      cleaning 2011-01-02       eggs
17  alice      brush      dan      cleaning 2011-01-02        pen
18  alice      brush      dan      cleaning 2011-01-02    eggroll
19  alice       eggs     jaju       edibles 2011-01-03       eggs
20  alice       eggs     jaju       edibles 2011-01-03        pen
21  alice       eggs     jaju       edibles 2011-01-03    eggroll

我想知道是否有人消费了收到的产品，我需要保留其余信息，因为稍后我需要查看它是否受到品牌或产品类型的影响。为此，我使用以下代码创建一个新列，该列给出以下结果。

代码：

  combined['output']= (combined.groupby('a_id')
           .apply(lambda x : x['b_received'].isin(x['c_consumed']).astype('i4'))
           .reset_index(level='a_id', drop=True))

生成的数据框是

       a_id b_received brand_id type_received       date c_consumed  output
  0     sam       soap     bill       edibles 2011-01-01        oil       1
  1     sam       soap     bill       edibles 2011-01-01      bread       1
  2     sam       soap     bill       edibles 2011-01-01       soap       1
  3     sam        oil    chris       utility 2011-01-02        oil       1
  4     sam        oil    chris       utility 2011-01-02      bread       1
  5     sam        oil    chris       utility 2011-01-02       soap       1
  6     sam      brush      dan       grocery 2011-01-01        oil       0
  7     sam      brush      dan       grocery 2011-01-01      bread       0
  8     sam      brush      dan       grocery 2011-01-01       soap       0
  9   harry        oil    chris      clothing 2011-01-02      shoes       1
 10  harry        oil    chris      clothing 2011-01-02        oil       1
 11  harry      shoes    nancy       edibles 2011-01-03      shoes       1
 12  harry      shoes    nancy       edibles 2011-01-03        oil       1
 13  alice       beer    peter     breakfast 2011-01-03       eggs       0
 14  alice       beer    peter     breakfast 2011-01-03        pen       0
 15  alice       beer    peter     breakfast 2011-01-03    eggroll       0
 16  alice      brush      dan      cleaning 2011-01-02       eggs       0
 17  alice      brush      dan      cleaning 2011-01-02        pen       0
 18  alice      brush      dan      cleaning 2011-01-02    eggroll       0
 19  alice       eggs     jaju       edibles 2011-01-03       eggs       1
 20  alice       eggs     jaju       edibles 2011-01-03        pen       1
 21  alice       eggs     jaju       edibles 2011-01-03    eggroll       1

正如你所看到的输出结果是错误的，我真正想要的是一个更像这样的数据集

      a_id b_received brand_id c_consumed type_received       date  output 
 0    sam       soap     bill        oil       edibles 2011-01-01       1   
 1    sam        oil    chris        NaN       utility 2011-01-02       1   
 2    sam      brush      dan       soap       grocery 2011-01-03       0   
 3  harry        oil    chris      shoes      clothing 2011-01-04       1   
 4  harry      shoes    nancy        oil       edibles 2011-01-05       1   
 5  alice       beer    peter       eggs     breakfast 2011-01-06       0   
 6  alice      brush      dan      brush      cleaning 2011-01-07       1   
 7  alice       eggs     jaju        NaN       edibles 2011-01-08       1

我可以在合并后使用 drop_duplicates 处理重复，但生成的数据框太大而无法合并。

我真的需要在合并期间或合并之前处理重复，因为生成的数据帧太大，python 无法处理，它会给我带来内存错误。

关于如何改进我的合并或以任何其他方式在不合并的情况下获取输出列的任何建议？

最后，我只需要日期列和输出列来计算对数赔率，并创建一个时间序列。但由于文件大小，我一直在合并文件。

【问题讨论】：

你为什么不想在 SQLite 中做呢？ IMO、SQLite（或任何其他 RDBMS）会更有效列a_id 是否总是在两个数据帧上匹配？看起来您想要进行水平连接而不是合并。 @MaxU 我对 SQLite 不太了解，即使在 SQLite 中，当我合并它们时，我也花了 7 个多小时来合并这两个文件，当时我的电脑死机了 @AmitSinghParihar，我建议您使用功能更强大的 RDBMS 来处理如此大量的数据，例如 mysql（它是免费的）。在您使用的任何 RDBMS 中，首先在连接列 (a_id) 上的两个表上创建索引。您确定需要左外连接（就 Pandas 进行合并），它可以与“内连接”一起使用吗？ @MaxU，当我尝试innerjoin时，它会带走具有空值的行，稍后我需要空值来计算列的长度以计算概率 【参考方案1】：

请注意，我执行了两个 groupby 操作来获取输出表。我将 b_received 添加到要分组的键中，并在第二个 groupby 上取第一个值，因为该分组级别的所有值都是相同的。

output = ((combined
           .groupby(['a_id', 'b_received'])
           .apply(lambda x : x['b_received'].isin(x['c_consumed'])
           .astype(int)))
          .groupby(level=[0, 1])
          .first())

output.name = 'output'

>>> (df_id[['a_id', 'b_received', 'date']]
     .merge(output.reset_index(), on=['a_id', 'b_received']))
    a_id b_received       date  output
0    sam       soap 2011-01-01       1
1    sam        oil 2011-01-02       1
2    sam      brush 2011-01-01       0
3  harry        oil 2011-01-02       1
4  harry      shoes 2011-01-03       1
5  alice       beer 2011-01-03       0
6  alice      brush 2011-01-02       0
7  alice       eggs 2011-01-03       1

【讨论】：

感谢您的解决方案。它适用于合并的数据文件。但是，我的问题也是合并时生成的文件的大小，它不受支持，并且 python 给出了内存错误。您创建的输出函数基于存在合并文件（'combined'）的条件。当我尝试应用合并功能来组合数据帧时，我意识到它不会发生。无论如何，我可以在不进行合并的情况下获得最终的 df？我的约束是创建“组合”数据框。出于说明的目的，我做了它，但我的实际代码无法合并两个数据帧（df，df_id）。合并它们会给我带来记忆错误。现在您的代码是完美的，但是当您创建输出时，它使用“组合”数据框，但由于它的大小，我无法制作组合数据框。因此，我不能使用输出代码，没有它，第二个代码也不起作用。我尝试将数据拆分为 1/5，合并成功，但在运行上述两段代码时再次出现内存错误。合并后的最终文件为 50 GB（那是在我使用 SQLite 合并后，但在 pandas 中甚至无法获得它）。我有 16GB RAM 和四核。让我们continue this discussion in chat。

以上是关于要合并的大文件。如何防止熊猫合并中的重复？的主要内容，如果未能解决你的问题，请参考以下文章