优化 MongoDB 聚合查询性能

Posted 2023-04-14

技术标签:

【中文标题】优化 MongoDB 聚合查询性能【英文标题】：Optimise MongoDB aggregate query performance 【发布时间】：2021-10-12 21:56:53 【问题描述】：

我有下一个数据库结构：

工作区：

	Key	Index
PK	id	id
	content

项目：

	Key	Index
PK	id	id
FK	workspace	workspace_1
	deleted	deleted_1
	content

项目：

	Key	Index
PK	id	id
FK	project	project_1
	type	_type_1
	deleted	deleted_1
	content

我需要为 workspace 中的每个 project 计算每个 type 的 items 数量，例如预期输出：

[
   _id: 'projectId1', itemType1Count: 100, itemType2Count: 50, itemType3Count: 200 ,
   _id: 'projectId2', itemType1Count: 40, itemType2Count: 100, itemType3Count: 300 ,
  ....
]

经过几次尝试和一些调试，我创建了一个查询，它提供了我需要的输出：

const pipeline = [
     $match:  workspace: 'workspaceId1'  ,
    
      $lookup: 
        from: 'items',
        let:  id: '$_id' ,
        pipeline: [
          
            $match: 
              $expr: 
                $eq: ['$project', '$$id'],
              ,
            ,
          ,
          // project only fields necessary for later pipelines to not overload
          // memory and to not get `exceeded memory limit for $group` error
           $project:  _id: 1, type: 1, deleted: 1  ,
        ],
        as: 'items',
      ,
    ,
    // Use $unwind here to optimize aggregation pipeline, see:
    // https://***.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
    // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
    // Error appears not in all requests and it's really strange and hard to debug.
     $unwind: '$items' ,
     $match:  'items.deleted':  $eq: false   ,
    
      $group: 
        _id: '$_id',
        items:  $push: '$items' ,
      ,
    ,
    
      $project: 
        _id: 1,
        // Note: I have only 3 possible item types, so it's OK that it's names hardcoded.
        itemType1Count: 
          $size: 
            $filter: 
              input: '$items',
              cond:  $eq: ['$$this.type', 'type1'] ,
            ,
          ,
        ,
        itemType2Count: 
          $size: 
            $filter: 
              input: '$items',
              cond:  $eq: ['$$this.type', 'type2'] ,
            ,
          ,
        ,
        itemType3Count: 
          $size: 
            $filter: 
              input: '$items',
              cond:  $eq: ['$$this.type', 'type3'] ,
            ,
          ,
        ,
      ,
    ,
  ]

const counts = await Project.aggregate(pipeline)

查询按预期工作，但速度很慢...如果我在一个 工作区 中有大约 1000 个项目，则大约需要 8 秒去完成。任何如何使其更快的想法都值得赞赏。

谢谢。

【问题讨论】：

【参考方案1】：

假设您的索引已正确编入索引，它们包含“正确”字段，我们仍然可以对查询本身进行一些调整。

方法 1：保留现有的集合架构

db.projects.aggregate([
  
    $match: 
      workspace: "workspaceId1"
    
  ,
  
    $lookup: 
      from: "items",
      let: id: "$_id",
      pipeline: [
        
          $match: 
            $expr: 
              $and: [
                $eq: ["$project","$$id"],
                $eq: ["$deleted",false]
              ]
            
          
        ,
        // project only fields necessary for later pipelines to not overload
        // memory and to not get `exceeded memory limit for $group` error
        
          $project: 
            _id: 1,
            type: 1,
            deleted: 1
          
        
      ],
      as: "items"
    
  ,
  // Use $unwind here to optimize aggregation pipeline, see:
  // https://***.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d
  // Without $unwind we may get an `matching pipeline exceeds maximum document size` error.
  // Error appears not in all requests and it's really strange and hard to debug.
  
    $unwind: "$items"
  ,
  
    $group: 
      _id: "$_id",
      itemType1Count: 
        $sum: 
            "$cond": 
                "if": $eq: ["$items.type","type1"],
                "then": 1,
                "else": 0
            
        
      ,
      itemType2Count: 
        $sum: 
            "$cond": 
                "if": $eq: ["$items.type","type2"],
                "then": 1,
                "else": 0
            
        
      ,
      itemType3Count: 
        $sum: 
            "$cond": 
                "if": $eq: ["$items.type","type1"],
                "then": 1,
                "else": 0
            
        
      
    
  
])

有 2 个主要变化：

items.deleted : false

$lookup

items

items: $push: '$items'

$group

这里是Mongo playground 供您参考。（至少为了新查询的正确性）

方法2：如果可以修改集合架构。我们可以像这样将projects.workspace 非规范化为items 集合：


    "_id": "i1",
    "project": "p1",
    "workspace": "workspaceId1",
    "type": "type1",
    "deleted": false

这样，您可以跳过$lookup。一个简单的$match 和$group 就足够了。

db.items.aggregate([
  
    $match: 
      "deleted": false,
      "workspace": "workspaceId1"
    
  ,
  
    $group: 
      _id: "$project",
      itemType1Count: 
        $sum: 
          "$cond": 
            "if": $eq: ["$type","type1"],
            "then": 1,
            "else": 0
          
        
      ,
      ...

这是带有非规范化架构的 Mongo playground 供您参考。

【讨论】：

感谢您的回答。方法 1 看起来更简洁，但不幸的是，它的执行时间几乎与我的原始查询相同，大约 8 秒相同的数据。我可能会注意到，items 集合非常大（大约 70 万条记录）。也许这是这种聚合的最大可能结果？你能确定是不是查找慢吗？如果是这种情况，您可能希望为项目和已删除字段上的项目添加复合索引。这应该会使初始 $match 更快使用复合索引，它的工作速度甚至更慢 1 秒 :( 您可以尝试使用explain 来检查您的查询执行计划吗？我的回答只能作为我们在代码方面可以尝试的一些直观指导。我们可能需要更多信息（例如索引使用情况）来提供更多帮助。 Explain 仅使用$match 显示第一阶段的执行计划。是的，我完全确定管道中最慢的部分是$lookup，所有其他阶段都运行得非常快。此外，有趣的是，使用相同数据的此查询在生产环境中的运行速度比在本地环境中快 2 倍。也许是因为有更多的空闲 RAM 可用于索引？无论如何，我决定尝试第二种方法并修改数据结构。这将有助于更快地对报告进行其他查询。感谢您的帮助，@ray

以上是关于优化 MongoDB 聚合查询性能的主要内容，如果未能解决你的问题，请参考以下文章