GraphFrames的PageRank错误

Posted 2023-04-18

技术标签:

【中文标题】GraphFrames的PageRank错误【英文标题】：Errors in PageRank of GraphFrames 【发布时间】：2018-05-25 08:20:55 【问题描述】：

我是 pyspark 的新手，正在尝试了解 PageRank 的工作原理。我在 Cloudera 上的 Jupyter 中使用 Spark 1.6。我的顶点和边（以及模式）的屏幕截图在这些链接中：verticesRDD 和 edgesRDD

到目前为止，我的代码如下：

#import relevant libraries for Graph Frames
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.functions import desc
from graphframes import *

#Read the csv files 
verticesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/station.csv")
edgesRDD = sqlContext.read.format("com.databricks.spark.csv").options(header='true', inferschema='true').load("filepath/trip.csv")

#Renaming the id columns to enable GraphFrame 
verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Trip ID", "id")
edgesRDD = edgesRDD.withColumnRenamed("Start Station", "src")
edgesRDD = edgesRDD.withColumnRenamed("End Station", "dst")

#Register as temporary tables for running the analysis
verticesRDD.registerTempTable("verticesRDD")
edgesRDD.registerTempTable("edgesRDD")
#Note: whether i register the RDDs as temp tables or not, i get the same results... so im not sure if this step is really needed

#Make the GraphFrame
g = GraphFrame(verticesRDD, edgesRDD)

现在当我运行 pageRank 函数时：

g.pageRank(resetProbability=0.15, maxIter=10)

Py4JJavaError：调用 o98.run 时发生错误。：org.apache.spark.SparkException：作业因阶段失败而中止：阶段 79.0 中的任务 0 失败 1 次，最近一次失败：阶段 79.0 中的任务 0.0 丢失（ TID 2637, localhost): scala.MatchError: [null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]]（属于 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema 类）

results = g.pageRank(resetProbability=0.15, maxIter=10, sourceId="id")

Py4JJavaError：调用 o166.run 时发生错误。：org.graphframes.NoSuchVertexException：GraphFrame 算法给定的顶点 ID 在 Graph 中不存在。 GraphFrame 中不包含顶点 ID id(v:[id: int, name: string, lat: double, long: double, dockcount: int, landmark: string, installation: string], e:[src: string, dst: string , id: int, Duration: int, Start Date: string, Start Terminal: int, End Date: string, End Terminal: int, Bike #: int, 订阅者类型: string, 邮编: string])

ranks = g.pageRank.resetProbability(0.15).maxIter(10).run()

AttributeError: 'function' 对象没有属性 'resetProbability'

ranks = g.pageRank(resetProbability=0.15, maxIter=10).run()

Py4JJavaError：调用 o188.run 时发生错误。：org.apache.spark.SparkException：作业因阶段失败而中止：阶段 90.0 中的任务 0 失败 1 次，最近一次失败：阶段 90.0 中丢失任务 0.0（ TID 2641, localhost): scala.MatchError: [null,null,[913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139]]（属于 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema 类）

我正在阅读PageRank，但不明白我哪里出错了.. 任何帮助将不胜感激

【问题讨论】：

【参考方案1】：

问题在于我是如何定义顶点的。我正在将“station_id”重命名为“id”，而事实上，它必须是“name。所以这一行

verticesRDD = verticesRDD.withColumnRenamed("station_ID", "id")

必须是

verticesRDD = verticesRDD.withColumnRenamed("name", "id")

pageRank 与此更改正常工作！

【讨论】：

以上是关于GraphFrames的PageRank错误的主要内容，如果未能解决你的问题，请参考以下文章

使用 spark-shell 安装包 Graphframes

PySpark与GraphFrames的安装与使用

找不到模块'graphframes'——Jupyter

Flink：PageRank 类型不匹配错误

GraphFrames 的 PySpark 异常

GraphFrames：合并具有相似列值的边缘节点