聚合后用蜂巢表读取和写入

Question

我们有一个蜂巢仓库，并希望使用spark来完成各种任务（主要是分类）。有时将结果写回蜂巢表。例如，我们编写了以下python函数来查找original_table第二列的总和，按原始列第一列分组。该函数有效，但我们担心它效率低下，特别是转换为键值对的映射和字典版本。函数combiner，mergeValue，mergeCombiner在别处定义，但工作正常。

from pyspark import HiveContext

rdd = HiveContext(sc).sql('from original_table select *')

#convert to key-value pairs
key_value_rdd = rdd.map(lambda x: (x[0], int(x[1])))

#create rdd where rows are (key, (sum, count)
combined = key_value_rdd.combineByKey(combiner, mergeValue, mergeCombiner)

# creates rdd with dictionary values in order to create schemardd
dict_rdd = combined.map(lambda x: {'k1': x[0], 'v1': x[1][0], 'v2': x[1][1]})

# infer the schema
schema_rdd = HiveContext(sc).inferSchema(dict_rdd)

# save
schema_rdd.saveAsTable('new_table_name')

是否有更有效的方法来做同样的事情？

Answer 1

另一答案

Answer 2

另一答案