使用 PySpark 连接速度很慢
Posted
技术标签:
【中文标题】使用 PySpark 连接速度很慢【英文标题】:Connecting is Slow with PySpark 【发布时间】:2019-04-05 09:25:07 【问题描述】:我正在使用 PySpark 使用以下代码:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Scoring System").getOrCreate()
df = spark.read.csv('output.csv')
df.show()
我在命令行上运行 python trial.py 后大约 5 到 10 分钟,没有进展:
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2019-05-05 22:58:31 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2019-05-05 22:58:32 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
[Stage 0:> (0 + 0) / 1]2019-05-05 23:00:08 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:23 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:38 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:00:53 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[Stage 0:> (0 + 0) / 1]2019-05-05 23:01:08 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:01:23 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
2019-05-05 23:01:38 WARN YarnScheduler:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
我预感我的工作节点中缺少资源(?),或者我错过了什么?
【问题讨论】:
您要读取的文件大小是多少? 嗨@Vitaliy,它大约 21 GB。 我的猜测是它试图推断架构。这样做需要加载大部分文件以了解数据的性质。尝试明确指定架构(这通常是最佳实践)。您可以在***.com/a/49281042/180650 看到如何执行此操作的示例 【参考方案1】:尝试增加Executor的数量和内存 pyspark --num-executors 5 --executor-memory 1G
【讨论】:
以上是关于使用 PySpark 连接速度很慢的主要内容,如果未能解决你的问题,请参考以下文章
Pandas UDF (PySpark) - 不正确的类型错误