使用 MongoDB 聚合将集合合并到固定大小

Posted 2023-02-16

技术标签:

【中文标题】使用 MongoDB 聚合将集合合并到固定大小【英文标题】：Merge collections up to a fixed size using MongoDB Aggregation 【发布时间】：2021-06-22 23:06:20 【问题描述】：

我有一个看起来像这样的集合：


    "_id" : id1,
    "field1" : 11,
    "field2": 101,
    "localityID" : 27


    "_id" : id2,
    "field1" : 22,
    "field2": 202,
    "localityID" : 27


    "_id" : id3,
    "field1" : 33,
    "field2": 303,
    "localityID" : 27


    "_id" : id4,
    "field1" : 44,
    "field2": 404,
    "localityID" : 27


    "_id" : id5,
    "field1" : 55,
    "field2": 505,
    "localityID" : 27


    "_id" : id6,
    "field1" : 66,
    "field2": 606,
    "localityID" : 61


    "_id" : id4,
    "field1" : 77,
    "field2": 707,
    "localityID" : 61

用例 - 我想以 3 的批次检索和处理具有相同 localityID 的记录。出于跟踪目的，我还想跟踪在特定批次中处理的记录

同样，我想使用 MongoDB 的聚合框架来组合具有相同 localityID 但最大为固定大小（如上所述为 3）的集合。

我想将上面的集合更新为：


  "_id" : "id111",
  "batchId" : "batch1",
  "localityID": 27,
  "batches": [
     
         "field1" : 11,
         "field2": 101
     ,
     
         "field1" : 22,
         "field2": 202
     ,
     
         "field1" : 33,
         "field2": 303
     
  ]


  "_id" : "id222",
  "batchId" : "batch2",
  "localityID": 27,
  "batches": [
     
         "field1" : 44,
         "field2": 404
     ,
     
         "field1" : 55,
         "field2": 505
     
  ]


  "_id" : "id333",
  "batchId" : "batch1",
  "localityID": 61,
  "batches": [
     
         "field1" : 66,
         "field2": 606
     ,
     
         "field1" : 77,
         "field2": 707
     
  ]

我尝试了几种聚合函数的组合，如下所示，但未能获得预期的结果。

（这能够将所有具有相同localityID 的记录组合在一起但只能在一个文档中，而不是批量）

db.old_collection.aggregate([
   "$group":  "_id": "$localityID"  ,
   "$lookup": 
    "from": "old_collection",
    "let":  "lid": "$_id" ,
    "pipeline": [
       "$match":  "$expr":  "$eq": [ "$localityID", "$$lid" ]  ,
       "$project":  "_id": 0, "field1": 1, "field2": 1  
    ],
    "as": "batches"
  ,
  "$out": "new_collection" 
])

上述聚合函数产生以下结果 -


  "_id" : "id111",
  "batchId" : "batch1",
  "localityID": 27,
  "batches": [
     
         "field1" : 11,
         "field2": 101
     ,
     
         "field1" : 22,
         "field2": 202
     ,
     
         "field1" : 33,
         "field2": 303
     ,
     
         "field1" : 44,
         "field2": 404
     ,
     
         "field1" : 55,
         "field2": 505
     
  ]


  "_id" : "id333",
  "batchId" : "batch1",
  "localityID": 61,
  "batches": [
     
         "field1" : 66,
         "field2": 606
     ,
     
         "field1" : 77,
         "field2": 707
     
  ]

这可以通过 Mongo 的聚合框架实现吗？或者我会用其他东西更好吗？

【问题讨论】：

您的聚合管道没有任何batchId 字段，因此您提供的结果肯定不是来自此聚合管道。我不明白 batchId 字段的逻辑。是 @WernfriedDomscheit ，batchId 字段在输入中不存在。 .每个localityID 的batchId 的值可以是一个简单的序列号，从0 到为该localityID 创建的文档总数 【参考方案1】：

这个想法来自this answer。您可以使用$range 生成一个索引数组，其中步骤参数设置为一些bucketSize。那么你只需要$slice 就可以得到一个大小为bucketSize 的数组，试试这个：

let bucketSize = 3;

db.old_collection.aggregate([
    
        $group: 
            _id: "$localityID",
            id:  $first: "$_id" ,
            localityID:  $first: "$localityID" ,
            batches: 
                $push: 
                    field1: "$field1",
                    field2: "$field2"
                
            
        
    ,
    
        $project: 
            _id: 0,
            localityID: "$localityID",
            batches: 
                $map: 
                    input:  $range: [0,  $size: "$batches" , bucketSize] ,
                    as: "index",
                    in:  $slice: ["$batches", "$$index", bucketSize] 
                
            
        
    ,
    
        $unwind: 
            path: "$batches",
            includeArrayIndex: "batchId"
        
    ,
    
        $addFields: 
            batchId: 
                $concat: [
                    "batch",
                     $toString:  $add: ["$batchId", 1]  
                ]
            
        
    ,
    // $sort is optional. You can remove it if not required.
    
        $sort: 
            localityID: 1,
            batchId: 1
        
    
     $out: "new_collection" 
]);

输出

[
    
        "_id": ObjectId("..."),
        "localityID": 27,
        "batches": [
            
                "field1": 11,
                "field2": 101
            ,
            
                "field1": 22,
                "field2": 202
            ,
            
                "field1": 33,
                "field2": 303
            
        ],
        "batchId": "batch1"
    ,
    
        "_id": ObjectId("..."),
        "localityID": 27,
        "batches": [
            
                "field1": 44,
                "field2": 404
            ,
            
                "field1": 55,
                "field2": 505
            
        ],
        "batchId": "batch2"
    ,
    
        "_id": ObjectId("..."),
        "localityID": 61,
        "batches": [
            
                "field1": 66,
                "field2": 606
            ,
            
                "field1": 77,
                "field2": 707
            
        ],
        "batchId": "batch1"
    
]

【讨论】：

在为大型集合（超过 6000 万条记录）实施上述解决方案时，我收到以下错误 $push used too much memory and cannot spill to disk.。有什么方法可以修改上述解决方案以解决此错误。我尝试启用allowDiskUse，但没有解决问题。完整错误消息

The full response is "operationTime": "$timestamp": "t": 1617712444, "i": 1, "ok": 0.0, "errmsg": "$push used too much memory and cannot spill to disk. Memory limit: 104857600 bytes", "code": 146, "codeName": "ExceededMemoryLimit", "$clusterTime": "clusterTime": "$timestamp": "t": 1617712522, "i": 1,  "keyId": 6903928055920590851

我不擅长编写内存高效查询。如果我找到一些解决方案，我会更新你。对不起！对大型收藏的解决方案有任何想法吗？【参考方案2】：

如前所述，我不了解字段 batchId 的逻辑。除此之外，简单的解决方案可能是这个：

db.collection.aggregate([
    $group:  _id: "$localityID", batches:  $push:  field1: "$field1", field2: "$field2"    ,
   
      $project: 
         localityID: "$_id",
         batches:  $slice: ["$batches", 1, 3] 
      
   
])

【讨论】：

以上是关于使用 MongoDB 聚合将集合合并到固定大小的主要内容，如果未能解决你的问题，请参考以下文章