Neo4j图形数据库中对复杂匹配进行评分时的性能？

Posted 2023-03-29

技术标签:

【中文标题】Neo4j图形数据库中对复杂匹配进行评分时的性能？【英文标题】：Performance during scoring complex match in Neo4j graph database? 【发布时间】：2018-04-24 09:14:52 【问题描述】：

我有 Neo4j 3.3.5 图形数据库：27GB，50kk 节点，500kk 关系。上的索引。 Schema。 PC：16GB 内存，4 核。

任务是为给定的查询数据找到最匹配的公司。节点：我需要获取的公司与节点有多种关系：分支机构，：国家等。查询数据有BranchIds、CountryIds等

目前我正在使用这样的密码从一个关系中获取分数（结果为 500k 行）：

MATCH (c:Company)-[r:HAS_BRANCH]->(b:Branch)
WHERE b.branchId in [27444, 1692, 23409, ...] //around 10 ids per query
RETURN 
c.companyId as Id, 
case r.branchType 
 when 0 then 25
 ... // //around 7 conditions per query 
 when 10 then 20 
end as Score

我必须为以下所有关系类型评分：Company，按 Id 分组，sum Score，订购并获得前 100 个结果。

由于缺乏联合后处理，我使用collect + unwind 来合并所有关系的分数。

不幸的是，性能很低。我在 5-10 秒内得到一个关系查询的响应（如上）。当我尝试将结果与collect + unwind 结合时，查询“永不”结束。

什么是更好/正确的方法？也许我在图形设计方面做错了什么？硬件配置低？或者图数据库中是否有一些算法可以匹配分数图（查询数据）？

更新

查询说明：

用户可以在我们的系统中搜索公司。对于他的查询，我们准备了包含分支机构 ID、国家、单词等的查询数据。在查询的结果中，我们希望获得带有分数的最佳匹配公司 ID 的列表。

例如用户可以搜索从西班牙生产木桌的新公司。

组合查询示例：

MATCH (c:Company)-[r:HAS_BRANCH]->(b:Branch)
WHERE b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] 
WITH case r.branchType 
when "0" then collect(id:c.companyId, score: 25) 
 when "1" then collect(id:c.companyId, score: 19) 
 when "2" then collect(id:c.companyId, score: 20) 
 when "3" then collect(id:c.companyId, score: 19) 
 when "4" then collect(id:c.companyId, score: 20) 
 when "5" then collect(id:c.companyId, score: 15) 
 when "6" then collect(id:c.companyId, score: 6) 
 when "7" then collect(id:c.companyId, score: 5) 
 when "8" then collect(id:c.companyId, score: 4) 
 when "9" then collect(id:c.companyId, score: 4) 
 when "10" then collect(id:c.companyId, score: 20) 
end as rows
MATCH (c:Company)-[r:HAS_REVERTED_BRANCH]->(b:Branch)
WHERE b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] 
WITH rows + case r.branchType 
when "0" then collect(id:c.companyId, score: 25) 
 when "1" then collect(id:c.companyId, score: 19) 
 when "2" then collect(id:c.companyId, score: 20) 
 when "3" then collect(id:c.companyId, score: 19) 
 when "10" then collect(id:c.companyId, score: 20) 
end as rows
MATCH (c:Company)-[r:HAS_COUNTRY]->(cou:Country)
WHERE cou.countryId in ["9580" , "18551" , "15895"] 
WITH rows + case r.branchType 
when "0" then collect(id:c.companyId, score: 30) 
 when "2" then collect(id:c.companyId, score: 15) 
 end as rows
... //here I would add in future other relations scoring
UNWIND rows AS row
RETURN row.id AS Id, sum(row.score) AS Score
ORDER BY Score DESC
LIMIT 100

【问题讨论】：

你能分享一下你的查询的解释和完整的查询吗？此外，HAS_BRANCH 关系似乎对您的业务很重要，通常如果它很重要，它应该是一个节点。 @logisima 谢谢你的评论，我已经更新了问题 【参考方案1】：

你可以试试这个查询，看看它是否更好：

MATCH (c:Company) WITH c
OPTIONAL MATCH (c)-[r1:HAS_BRANCH]->(b:Branch) WHERE b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] 
OPTIONAL MATCH (c)-[r2:HAS_REVERTED_BRANCH]->(c:Branch) WHERE c.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] 
OPTIONAL MATCH (c)-[r3:HAS_COUNTRY]->(cou:Country) WHERE cou.countryId in ["9580" , "18551" , "15895"] 
WITH c, 
    case r1.branchType 
      when "0" then 25
      when "1" then 19 
      when "2" then 20 
      when "3" then 19 
      when "4" then 20 
      when "5" then 15 
      when "6" then 6 
      when "7" then 5 
      when "8" then 4 
      when "9" then 4 
      when "10" then 20 
    end as branchScore,
    case r2.branchType 
      when "0" then  25 
      when "1" then  19 
      when "2" then  20 
      when "3" then  19 
      when "10" then  20 
    end as revertedBranchScore,
    case r3.branchType 
      when "0" then  30
      when "2" then  15 
    end as countryScore

WITH c.id AS Id, branchScore + revertedBranchScore + countryScore AS Score
RETURN Id, sum(Score) AS Score
ORDER BY Score DESC
LIMIT 100

或者更好的是这个（但前提是Company 节点强制链接到Country 和Branch）：

MATCH 
  (c:Company)-[r1:HAS_BRANCH]->(b:Branch),
  (c)-[r2:HAS_REVERTED_BRANCH]->(c:Branch),
  (c)-[r3:HAS_COUNTRY]->(cou:Country)
WHERE 
  b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] AND 
  c.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] AND
  cou.countryId in ["9580" , "18551" , "15895"]
WITH c, 
    case r1.branchType 
      when "0" then 25
      when "1" then 19 
      when "2" then 20 
      when "3" then 19 
      when "4" then 20 
      when "5" then 15 
      when "6" then 6 
      when "7" then 5 
      when "8" then 4 
      when "9" then 4 
      when "10" then 20 
    end as branchScore,
    case r2.branchType 
      when "0" then  25 
      when "1" then  19 
      when "2" then  20 
      when "3" then  19 
      when "10" then  20 
    end as revertedBranchScore,
    case r3.branchType 
      when "0" then  30
      when "2" then  15 
    end as countryScore

WITH c.id AS Id, branchScore + revertedBranchScore + countryScore AS Score
RETURN Id, sum(Score) AS Score
ORDER BY Score DESC
LIMIT 100

【讨论】：

我已经测试了第一个查询（不幸的是第二个不是一个选项）。它的性能更好，我能够从 db 获得响应（我必须在 case 语句中添加 else 0）。但还有几分钟的时间。你认为，如果我将 shema 更改为 :Company-[:HAS]->:Branch-[:IS]->:BranchType ，它会表现得更好吗？你正在计算所有公司的分数（即1.5M），所以需要时间是正常的，并且你说你不能过滤它们才能得分......你的建议会有所帮助。另一种解决方案是使用第二种解决方案，为Country、Branch 等添加一些空节点我测试了第二个查询，我认为它给出了不同的结果。它匹配的公司在列出的 branchIds 中至少有一个 id，在 revertedBranchIds 中有一个，在 countriesIds 中有一个（where 子句）。我需要或。空节点没有区别。我现在要测试更改的架构。我将其设置为答案，因为速度是最好的。【参考方案2】：

让我们看看我们是否可以通过使用模式理解和 reduce() 函数在查询进行时更新每个公司的分数，以及等到最后投影出 id 属性，从而降低匹配的基数：

MATCH (c:Company)
WITH c, [(c)-[r:HAS_BRANCH]->(b:Branch) 
 WHERE b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] | r.branchType] as hasBranchTypes
WITH c, reduce(runningScore = 0, type in hasBranchTypes | runningScore + 
 case type 
 when "0" then 25
 when "1" then 19
 when "2" then 20 
 when "3" then 19 
 when "4" then 20 
 when "5" then 15 
 when "6" then 6 
 when "7" then 5 
 when "8" then 4 
 when "9" then 4 
 when "10" then 20 
 end ) as score

WITH c, score, [(c:Company)-[r:HAS_REVERTED_BRANCH]->(b:Branch)
 WHERE b.branchId in ["27444" , "1692" , "23409" , "8744" , "9192" , "26591" , "21396" , "27151" , "20228" , "3517" , "25058" , "29549"] | r.branchType] as revertedBranchTypes
WITH c, reduce(runningScore = score, type in revertedBranchTypes | runningScore + 
 case type
 when "0" then 25
 when "1" then 19 
 when "2" then 20 
 when "3" then 19 
 when "10" then 20 
end ) as score

WITH c, score, [(c:Company)-[r:HAS_COUNTRY]->(cou:Country)
 WHERE cou.countryId in ["9580" , "18551" , "15895"] | r.branchType] as hasCountryTypes
WITH c, reduce(runningScore = score, type in hasCountryTypes | runningScore + 
 case type
 when "0" then 30 
 when "2" then 15 
 end ) as score
 //here I would add in future other relations scoring

WITH c, score
ORDER BY score DESC
LIMIT 100
RETURN c.id as Id, score as Score

【讨论】：

感谢您的回答。我已经测试过这种方法，它比较慢。我必须使用WITH c LIMIT 100000 (~10%) 限制第一行的公司，以便以比从 1 个答案查询慢 6 倍的速度获得任何结果。 Query plan.

以上是关于Neo4j图形数据库中对复杂匹配进行评分时的性能？的主要内容，如果未能解决你的问题，请参考以下文章