使用Gremlin在图中查找最长的循环路径
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用Gremlin在图中查找最长的循环路径相关的知识,希望对你有一定的参考价值。
我正在尝试构建Gremlin查询以在DSE Graph中使用并启用地理搜索(在Solr中编制索引)。问题是图形是如此密集地互连,以至于循环路径遍历超时。现在我正在使用的原型图有~1600个顶点和~35K边。通过每个顶点的三角形数量也总结如下:
+--------------------+-----+
| gps|count|
+--------------------+-----+
|POINT (-0.0462032...| 1502|
|POINT (-0.0458048...| 405|
|POINT (-0.0460680...| 488|
|POINT (-0.0478356...| 1176|
|POINT (-0.0479465...| 5566|
|POINT (-0.0481031...| 9896|
|POINT (-0.0484724...| 433|
|POINT (-0.0469379...| 302|
|POINT (-0.0456595...| 394|
|POINT (-0.0450722...| 614|
|POINT (-0.0475904...| 3080|
|POINT (-0.0479464...| 5566|
|POINT (-0.0483400...| 470|
|POINT (-0.0511753...| 370|
|POINT (-0.0521901...| 1746|
|POINT (-0.0519999...| 1026|
|POINT (-0.0468071...| 1247|
|POINT (-0.0469636...| 1165|
|POINT (-0.0463685...| 526|
|POINT (-0.0465805...| 1310|
+--------------------+-----+
only showing top 20 rows
我预计图表最终会增长到一个巨大的尺寸,但我会限制搜索周期到地理区域(例如半径~300米)。
到目前为止,我最好的尝试是以下的一些版本:
g.V().has('gps',Geo.point(lon, lat)).as('P')
.repeat(both()).until(cyclicPath()).path().by('gps')
Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request
为了便于说明,下图显示了绿色的起始顶点和红色的终止顶点。假设所有顶点都是互连的。我对绿色和红色之间的最长路径感兴趣,这将是围绕该块的环绕。
我读过的一些链接无济于事:
1)http://tinkerpop.apache.org/docs/current/recipes/#cycle-detection
2)Longest acyclic path in a directed unweighted graph
3)https://groups.google.com/forum/#!msg/gremlin-users/tc8zsoEWb5k/9X9LW-7bCgAJ
EDIT
使用Daniel的建议创建一个子图,它仍然超时:
gremlin> hood = g.V().hasLabel('image').has('gps', Geo.inside(point(-0.04813968113126384, 51.531259899256995), 100, Unit.METERS)).bothE().subgraph('hood').cap('hood').next()
==>tinkergraph[vertices:640 edges:28078]
gremlin> hg = hood.traversal()
==>graphtraversalsource[tinkergraph[vertices:640 edges:28078], standard]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x')
==>v[{~label=image, partition_key=2507574903070261248, cluster_key=RFAHA095CLK-2017-09-14 12:52:31.613}]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()
Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request: [91b6f1fa-0626-40a3-9466-5d28c7b5c27c - hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()]
Type ':help' or ':h' for help.
Display stack trace? [yN]n
基于跳数的最长路径将是您可以找到的最后一条路径。
g.V().has('gps', Geo.point(x, y)).as('x').
repeat(both().simplePath()).
emit(where(both().as('x'))).
both().where(eq('x')).tail(1).
path()
除非你有一个非常小的(子)图,否则没有办法使这个查询在OLTP中表现良好。因此,根据您在图表中看到的“城市街区”,您应该首先将其作为子图提取,然后应用最长路径查询(在内存中)。
我提出的一个解决方案涉及使用Spark GraphFrames和标签传播算法(GraphFrames,LPA)。然后可以计算每个社区的平均GPS位置(实际上您甚至不需要平均值,只需每个社区的单个成员就足够了)以及每个社区成员代表之间存在的所有边缘(平均或其他)。
选择并保存图形的一个区域并保存顶点和边:
g.V().has('gps', Geo.inside(Geo.point(x,y), radius, Unit.METERS))
.subgraph('g').cap(g')
Spark片段:
import org.graphframes.GraphFrame
val V = spark.read.json("v.json")
val E = spark.read.json("e.json")
val g = GraphFrame(V,E)
val result = g.labelPropagation.maxIter(5).run()
val rdd = result.select("fullgps", "label").map(row => {
val coords = row.getString(0).split(",")
val x = coords(0).toDouble
val y = coords(1).toDouble
val z = coords(2).toDouble
val id = row.getLong(1)
(x,y,z,id)
}).rdd
// Average GPS:
val newVertexes = rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.toDF("lbl","gps")
rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.mapValues(value => (value,1)).reduceByKey{ case (((xL:Double,yL:Double,zL:Double), countL:Int), ((xR:Double,yR:Double,zR:Double), countR:Int)) => ((xR+xL,yR+yL,zR+yL),countR+countL) }.map{ case (id,((x,y,z),c)) => (id, ((x/c,y/c,z/c),c)) }.map{ case (id:Long, ((x:Double, y:Double, z:Double), count:Int)) => Array(x.toString,y.toString,z.toString,id.toString,count.toString) }.map(a => toCsv(a)).saveAsTextFile("avg_gps.csv")
// Keep IDs
val rdd2 = result.select("id", "label").map(row => {
val id = row.getString(0)
val lbl = row.getLong(1)
(lbl, id) }).rdd
val edgeDF = E.select("dst","src").map(row => (row.getString(0),row.getString(1))).toDF("dst","src")
// Src
val tmp0 = result.select("id","label").join(edgeDF, result("id") === edgeDF("src")).withColumnRenamed("lbl","src_lbl")
val srcDF = tmp0.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","src_lbl").withColumnRenamed("_2","src_edge")
// Dst
val tmp1 = result.select("id","label").join(edgeDF, result("id") === edgeDF("dst")).withColumnRenamed("lbl","dst_lbl")
val dstDF = tmp1.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","dst_lbl").withColumnRenamed("_2","dst_edge")
val newE = srcDF.join(dstDF, srcDF("src_lbl")===dstDF("dst_lbl"))
val newEdges = newE.filter(newE("src_edge")=!=newE("dst_edge")).select("src_edge","dst_edge").map(row => { (row.getLong(0).toString + "###" + row.getLong(1).toString, row.getLong(0), row.getLong(1)) }).withColumnRenamed("_1","edge").withColumnRenamed("_2","src").withColumnRenamed("_3","dst").dropDuplicates("edge").select("src","dst")
val newGraph = GraphFrames(newVertexes, newEdges)
然后平均位置通过边连接,在这种情况下问题从约1600个顶点和~35K边减少到25个顶点和54个边:
这里的非绿色区段(红色,白色,黑色等)代表各个社区。绿色圆圈是平均GPS位置,其大小与每个社区中的成员数量成比例。现在,执行OLTP算法要容易得多,例如Daniel在上面的评论中提出的。
以上是关于使用Gremlin在图中查找最长的循环路径的主要内容,如果未能解决你的问题,请参考以下文章