如何在 Apache Spark 中的 Group By Operation 形成的每个子集上应用用户定义函数？

Posted 2023-04-13

技术标签:

【中文标题】如何在 Apache Spark 中的 Group By Operation 形成的每个子集上应用用户定义函数？【英文标题】：How to apply User Defined Function on each subset formed by a Group By Operation in Apache Spark? 【发布时间】：2016-06-19 18:10:33 【问题描述】：

我有一个如下所示的数据框：

    [ID_number,cust_number,feature1,feature2,feature3,....]

现在我想编写一个按 ID_number 分组的查询，并在子集上应用用户定义函数

    [cust_number,feature1,feature2,feature3,......]

按每个 ID_number 分组 我需要对特征应用机器学习算法并以某种方式存储权重。

如何使用 Apache Spark DataFrames（使用 Scala）做到这一点？

【问题讨论】：

How can I define and use a User-Defined Aggregate Function in Spark SQL?的可能重复 【参考方案1】：

你可以做这样的事情（pyspark）。

schema_string = "cust_number,feature1,feature2,feature3"

fields = [StructField(field_name, StringType(), True) for field_name in schema_string.split(",")]

schema = StructType(字段) df = sql_context.createDataFrame(group_by_result_rdd, schema);

注意：这里我假设您的所有功能都是字符串类型。查看其他数据类型的 API 文档

【讨论】：

以上是关于如何在 Apache Spark 中的 Group By Operation 形成的每个子集上应用用户定义函数？的主要内容，如果未能解决你的问题，请参考以下文章