如何在 PySpark 中压缩两个 RDD?
Posted
技术标签:
【中文标题】如何在 PySpark 中压缩两个 RDD?【英文标题】:How can I zip two RDDs in PySpark? 【发布时间】:2017-01-24 00:38:30 【问题描述】:我一直在尝试合并低于 averagePoints1 和 kpoints2 的两个 Rdd。它一直抛出这个错误
ValueError: Can not deserialize RDD with different number of items in pair: (2, 1)
我尝试了很多东西,但我不能两个 Rdd 是相同的,具有相同数量的分区。我的下一步是在两个列表上应用欧几里德距离函数来测量差异,所以如果有人知道如何解决这个错误或有不同的方法我可以遵循,我将不胜感激。
提前致谢
averagePoints1 = averagePoints.map(lambda x: x[1])
averagePoints1.collect()
Out[15]:
[[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
kpoints2 = sc.parallelize(kpoints,4)
In [17]:
kpoints2.collect()
Out[17]:
[[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
【问题讨论】:
【参考方案1】:a= [[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
b= [[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
c = rdda.zip(rddb)
print(c.collect())
检查这个答案 Combine two RDDs in pyspark
【讨论】:
kpoints2 是来自 RDD 的样本平均点是来自 RDD 的平均点,我将编写一个 while 循环直到收敛,因此此解决方案无济于事。请问您还有其他想法吗!【参考方案2】:newSample=newCenters.collect() #new centers as a list
samples=zip(newSample,sample) #sample=> old centers
samples1=sc.parallelize(samples)
totalDistance=samples1.map(lambda (x,y):distanceSquared(x[1],y))
对于未来的搜索者,这是我最后遵循的解决方案
【讨论】:
以上是关于如何在 PySpark 中压缩两个 RDD?的主要内容,如果未能解决你的问题,请参考以下文章