删除mongodb中的重复值

Posted 2023-03-16

技术标签:

【中文标题】删除mongodb中的重复值【英文标题】：Remove duplicate values in mongodb 【发布时间】：2016-03-28 15:12:40 【问题描述】：

我正在使用 python 和 tornado 学习 mongodb。我有一个 mongodb 集合，当我这样做时

db.cal.find()

     
    "Pid" : "5652f92761be0b14889d9854",
    "Registration" : "TN 56 HD 6766",
    "Vid" : "56543ed261be0b0a60a896c9",
    "Period" : "10-2015",
    "AOs": [
        "14-10-2015",
        "15-10-2015",
        "18-10-2015",
        "14-10-2015",
        "15-10-2015",
        "18-10-2015"
    ],
    "Booked": [
        "5-10-2015",
        "7-10-2015",
        "8-10-2015",
        "5-10-2015",
        "7-10-2015",
        "8-10-2015"
    ],
    "NA": [
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015",
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015"
    ],

    "AOr": [
        "23-10-2015",
        "27-10-2015",
        "23-10-2015",
        "27-10-2015"
    ]

我需要一个操作来删除Booked,NA,AOs,AOr 中的重复值。最后应该是


     "Pid" : "5652f92761be0b14889d9854",
      "Registration" : "TN 56 HD 6766",
      "Vid" : "56543ed261be0b0a60a896c9",
      "AOs": [
        "14-10-2015",
        "15-10-2015",
        "18-10-2015",

      ],
      "Booked": [
        "5-10-2015",
        "7-10-2015",
        "8-10-2015",

      ],

      "NA": [
        "1-10-2015",
        "2-10-2015",
        "3-10-2015",
        "4-10-2015",

      ],

      "AOr": [
        "23-10-2015",
        "27-10-2015",

      ]

如何在 mongodb 中实现这一点？

【问题讨论】：

【参考方案1】：

您不能首先在此处使用“dropDups”语法，因为它已从 MongoDB 2.6 开始“弃用”并在 MongoDB 3.0 中删除，甚至无法使用。

要从每个列表中删除重复项，您需要在 python 中使用 set 类。

import pymongo


fields = ['Booked', 'NA', 'AOs', 'AOr']
client = pymongo.MongoClient()
db = client.test
collection = db.cal
bulk = colllection.initialize_ordered_op()
count = 0
for document in collection.find():
    update = dict(zip(fields, [list(set(document[field])) for field in fields])) 
    bulk.find('_id': document['_id']).update_one('$set': update)
    count = count + 1
    if count % 200 == 0:
        bulk.execute()
        bulk = colllection.initialize_ordered_op()

if count > 0:
    bulk.execute()

MongoDB 3.2 deprecates Bulk() 及其关联方法并提供.bulkWrite() 方法。此方法可从 Pymongo 3.2 以bulk_write() 获得。使用此方法要做的第一件事是导入UpdateOne 类。

from pymongo import UpdateOne


requests = [] # list of write operations
for document in collection.find():
    update = dict(zip(fields, [list(set(document[field])) for field in fields])) 
    requests.append(UpdateOne('_id': document['_id'], '$set': update))
collection.bulk_write(requests)

这两个查询给出了相同的预期结果：

'AOr': ['27-10-2015', '23-10-2015'],
 'AOs': ['15-10-2015', '14-10-2015', '18-10-2015'],
 'Booked': ['7-10-2015', '5-10-2015', '8-10-2015'],
 'NA': ['1-10-2015', '4-10-2015', '3-10-2015', '2-10-2015'],
 'Period': '10-2015',
 'Pid': '5652f92761be0b14889d9854',
 'Registration': 'TN 56 HD 6766',
 'Vid': '56543ed261be0b0a60a896c9',
 '_id': ObjectId('567f808fc6e11b467e59330f')

【讨论】：

【参考方案2】：

工作解决方案

我创建了一个基于 javascript 的工作解决方案，可在 mongo shell 上使用：

var codes = ["AOs", "Booked", "NA", "AOr"]

// Use bulk operations for efficiency
var bulk = db.dupes.initializeUnorderedBulkOp()

db.dupes.find().forEach(
  function(doc) 

    // Needed to prevent unnecessary operatations
    changed = false
    codes.forEach(
      function(code) 
        var values = doc[code]
        var uniq = []

        for (var i = 0; i < values.length; i++) 
          // If the current value can not be found, it is unique
          // in the "uniq" array after insertion
          if (uniq.indexOf(values[i]) == -1 )
            uniq.push(values[i])
          
        

        doc[code] = uniq

        if (uniq.length < values.length) 
          changed = true
        

      
    )

    // Update the document only if something was changed
    if (changed) 
      bulk.find("_id":doc._id).updateOne(doc)
    
  
)

// Apply all changes
bulk.execute()

带有示例输入的结果文档：

replset:PRIMARY> db.dupes.find().pretty()

  "_id" : ObjectId("567931aefefcd72d0523777b"),
  "Pid" : "5652f92761be0b14889d9854",
  "Registration" : "TN 56 HD 6766",
  "Vid" : "56543ed261be0b0a60a896c9",
  "Period" : "10-2015",
  "AOs" : [
    "14-10-2015",
    "15-10-2015",
    "18-10-2015"
  ],
  "Booked" : [
    "5-10-2015",
    "7-10-2015",
    "8-10-2015"
  ],
  "NA" : [
    "1-10-2015",
    "2-10-2015",
    "3-10-2015",
    "4-10-2015"
  ],
  "AOr" : [
    "23-10-2015",
    "27-10-2015"
  ]

在`dropDups` 中使用索引

这根本行不通。首先，根据 3.0 版，此选项不再存在。既然我们已经发布了 3.2，我们应该找到一种可移植的方式。

其次，即使使用 dropDups，文档也明确指出：

dropDups boolean : MongoDB 仅索引第一次出现的键，并从包含该键后续出现的集合中删除所有文档 .

因此，如果另一个文档的其中一个帐单代码中的值与前一个相同，则整个文档将被删除。

【讨论】：

您可以使用Remove Duplicates from JavaScript Array 中显示的方法之一从这些数组中删除重复项，然后使用带有批量操作的$set 运算符来更新文档。另请注意，MongoDB 3.2 弃用 Bulk() 及其相关方法。 shell 上既没有 Jquery 也没有 ecma 6，对吧？ ;) 我看不出识别唯一性的劣势在哪里。但是 3.2 很好，我也会添加解决方案。【参考方案3】：

假设您想从集合中删除重复的日期，因此您可以使用 dropDups: true 选项添加唯一索引：

db.bill_codes.ensureIndex("fieldName":1, unique: true, dropDups: true)

更多参考： db.collection.ensureIndex() - MongoDB Manual 3.0

注意：首先备份您的数据库，以防它不完全符合您的预期。

【讨论】：

这只会删除其中一个字段具有完全相同值的其他文档。我得到错误： "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "errmsg" : "exception: bad index key pattern Registration: \"TN 56 HD 6766\", Pid : \"5652f92761be0b14889d9854\" : 未知索引插件 'TN 56 HD 676'", "code" : 67, "ok" : 0 您必须提及您的集合键索引，而不是名称和节点进入标准。这不仅过时，而且显然是错误的，如果没有备份的建议，这将是完全危险的。已弃用的 dropDups 删除了所有 documents，这些文件恰好在索引中具有相同的键值，而不是重复值。【参考方案4】：

你试过“Distinct()”吗？

链接：https://docs.mongodb.org/v3.0/reference/method/db.collection.distinct/

使用 distinct 指定查询

以下示例从 dept 等于“A”的文档中返回嵌入在 item 字段中的字段 sku 的不同值：

db.inventory.distinct( "item.sku",  dept: "A"  )

该方法返回以下不同 sku 值的数组：

[ "111", "333" ]

【讨论】：

不会减少保存的数据。

以上是关于删除mongodb中的重复值的主要内容，如果未能解决你的问题，请参考以下文章

删除mongodb中的重复值

工作解决方案

在dropDups 中使用索引

在`dropDups` 中使用索引