Pyspark:保存数据框需要很长时间
Posted
技术标签:
【中文标题】Pyspark:保存数据框需要很长时间【英文标题】:Pyspark: saving a dataframe takes too long time 【发布时间】:2020-07-24 06:24:58 【问题描述】:我在 Databricks 中有一个 pyspark
数据框,如下所示。数据框由4844472
行组成。如果我显示数据框,则需要 2.70 分钟
mp.show()
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
| d1| d2| d3| d4| d5| idx1| idx2| idx3| idx4| idx5|stop_id|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
|9.595641094599582E-4|0.001349351889471...|0.001349351889471...|0.001349351889471...|0.001349351889471...| 28230| 17538| 26928| 19679| 17939| 0|
|0.001073202843710...|0.001270625201076...|0.001270625201076...|0.001270625201076...|0.001270625201076...| 28230| 17939| 17538| 26928| 24350| 1|
| 0.5018332258683085| 0.6136104198426214| 0.7515940084598605| 0.7923086910541867| 0.8528614951791638| 36508| 352| 41406| 8666| 49244| 2|
| 0.5018463054690073| 0.6132230820328666| 0.7511594488585572| 0.7918622881865559| 0.8524241433703198| 36508| 352| 41406| 8666| 49244| 3|
| 0.03892296364448588| 0.10489822816393383| 0.11015065590036736| 0.11083574976820404| 0.11107823934046591| 8666| 41406| 15387| 48473| 67948| 4|
| 10.02685122773378| 10.026859886604985| 10.026931929963919| 10.027049899955523| 10.02708752857522| 96155| 99120| 93630| 95712| 95603| 5|
| 0.0949417179722534| 0.09624239157298783| 0.09663276949951659| 0.09666148620040976| 0.09668953319514831| 43297| 43729| 1552| 13413| 28338| 6|
| 1.58821803894894| 1.700924159639725| 1.7100413892619204| 1.7659644202932838| 1.7716894514740533| 36508| 31802| 32021| 352| 41742| 7|
| 0.14986457872379202| 2.792841786494224| 3.836931747376168| 3.843816724749531| 3.9381444585189453| 35388| 41824| 31802| 32021| 41742| 8|
| 0.07721536374839136| 0.08156724948742954| 0.08179178347923806| 0.08197182486131196| 0.08230211151587184| 28852| 5286| 15116| 43700| 43297| 9|
| 0.07729090186445249| 0.08164045431643911| 0.08186450776482652| 0.08204599950900325| 0.08237366675966874| 28852| 5286| 15116| 43700| 43297| 10|
| 0.0769126077608714| 0.08126623437928565| 0.0814915948802193| 0.08166946271648905| 0.08200422782781865| 28852| 5286| 15116| 43700| 43297| 11|
| 0.07726243730458815| 0.08161929282648625| 0.08184445756719544| 0.08202232556886682| 0.0823560729538226| 28852| 5286| 15116| 43700| 43297| 12|
|0.003059320786099506|0.006049295374860495|0.006068327803710736|0.006073689066371823|0.006076662805415367|116339|107049|115787|110162|115325| 13|
|0.008394860593130297| 0.01460154756618598| 0.01464517932249764|0.014657324902570745|0.014662473132286578|116339|107049|115787|110162|115325| 14|
|0.001033675981635...|0.002839691356009074|0.003808353392737469|0.003818776963070...| 0.00398314099011343|116760|114788|115385|111516|116688| 15|
| 1.3353905677767632| 2.3859918643288904| 2.5926306493938913| 2.6000405755949068| 2.6901787282764746| 35388| 41824| 31802| 32021| 41742| 16|
| 0.00476180910371182|0.005343904103854576|0.005609118384537962|0.005762718043973694|0.005970448424488381| 81157| 81355| 79754| 79586| 80617| 17|
|9.337105318309089E-5|4.923642967966935E-4|6.450655567561298E-4|7.293044985905078E-4|0.001032583874460...|100731| 92800|100571| 89266| 88715| 18|
|0.004311753043494...|0.005008322149796936|0.005161120819827323|0.005407692984541363|0.005592887249437105| 81157| 79754| 79586| 77492| 80617| 19|
+--------------------+--------------------+--------------------+--------------------+--------------------+------+------+------+------+------+-------+
Command took 2.70 minutes
如果我尝试保存它需要无限时间
mp.write.mode("append").format("orc").save("mnt/tmp/")
【问题讨论】:
可能,保存时需要在 mnt 中添加斜线:"/mnt/";如果这是已安装的资源,则可能会出现物理写入;您可以尝试保存到 HDFS。此外,根据转换,“显示”只处理几十条记录,对于整个 DataFrame 评估,可以使用“df.rdd.count()”。 磁盘 IO 操作和并行性 w.r.t 需要观察的数据量。 我会明确说明您要保存到的路径,即.save("dbfs:/mnt/tmp")
【参考方案1】:
在保存之前尝试使用重新分区:
mp.repartition(200).write.mode("append").format("orc").save("mnt/tmp/")
根据数据框的大小使用适当数量的分区。最佳分区大小在 500MB 到 1GB 之间。
【讨论】:
适当的分区约为 128MB,等于 hdfs 块大小。 我认为这取决于用法和格式类型。以上是关于Pyspark:保存数据框需要很长时间的主要内容,如果未能解决你的问题,请参考以下文章