Mongodb插入50M文档而不重复的最佳方法

Posted 2023-04-18

技术标签:

【中文标题】Mongodb插入50M文档而不重复的最佳方法【英文标题】：Mongodb best way to insert 50M documents without duplicate 【发布时间】：2019-12-22 12:25:01 【问题描述】：

我需要创建一个包含超过 5000 万份文档的数据库。我使用 nodejs 和运行 Ubuntu 18.04 | 的 Mongodb 服务器12GO 内存 1333Mhz | 8核16线程。

我尝试了几种不同的性能结果。不幸的是，没有任何结论！

1) 使用 mongoimport csv ：最快的方法，总共 20 秒，但没有重复检查。

2) 每一行，find 然后insert 如果不存在：不可能重复，但速度很慢 (See log output stats for this method)

function insertMongo(entry) 
  return new Promise(resolve => 
    try 
      collection.insertOne(entry, function(err, result) 
        insertCount++;
        insertTotalCount++;
        resolve(true);
      );
     catch(e) 
      resolve(false);
    
  );


function findMongo(entry) 
  return new Promise(resolve => 
    try 
      collection.find( entry ).toArray(function(err, docs) 
        assert.equal(err, null);
        if (docs[0] == null) 
          findCount++;
          resolve(true);
         else 
          resolve(false);
        
      );
     catch(e) 
      resolve(false);
    
  );

2) 每行，更新宽度UPSET：不可能重复，但速度很慢 (See log output stats for this method)

你觉得日志中的速度正常吗？有没有办法在数据量很大的情况下更快？

我看过很多关于这个主题的论坛，没有任何结论。

【问题讨论】：

是什么让文档“重复”？所有数据均由第三方提供，数据可能相同。如果它们严格相同，则必须删除重复项，如果只有 2 个重复字段，我将进行更新以合并两者。 【参考方案1】：

在这种情况下，您不应使用insertOne()，而应使用insertMany() 函数。阅读有关 insertMany here 的官方文档，并查找 Unordered Inserts 以了解如何处理重复项。

【讨论】：

如果我不确定这些值是否重复，我不能使用 InsertMany。即使它们是局部的。【参考方案2】：

为什么不在唯一性很重要的字段上添加unique index，然后只进行批量插入？

如果有任何失败，请跳过它并继续。您还将通过这种方式生成重复项列表。

【讨论】：

【参考方案3】：

尝试使用 MongoDB 的 Bulk API。https://docs.mongodb.com/manual/reference/method/Bulk/

【讨论】：

以上是关于Mongodb插入50M文档而不重复的最佳方法的主要内容，如果未能解决你的问题，请参考以下文章