如何在 PySpark 1.6 中将 DataFrame 列从字符串转换为浮点/双精度？

Posted 2023-04-14

技术标签:

【中文标题】如何在 PySpark 1.6 中将 DataFrame 列从字符串转换为浮点/双精度？【英文标题】：How to convert DataFrame columns from string to float/double in PySpark 1.6? 【发布时间】：2016-02-28 14:55:34 【问题描述】：

在 PySpark 1.6 DataFrame 中，目前没有 Spark 内置函数可以将字符串转换为浮点数/双精度数。

假设，我们有一个带有 ('house_name', 'price') 的 RDD，两个值都是字符串。您想将价格从字符串转换为浮点数。在 PySpark 中，我们可以应用 map 和 python 的 float 函数来实现。

New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price))    # this works

在 PySpark 1.6 Dataframe 中，它不起作用：

New_DF = rawdataDF.select('house name', float('price')) # did not work

在内置 Pyspark 功能可用之前，如何使用 UDF 实现这种转换？我开发了这个转换UDF如下：

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
    return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))

有没有更好更简单的方法来达到同样的效果？

【问题讨论】：

【参考方案1】：

根据documentation，您可以在这样的列上使用cast函数：

rawdata.withColumn("house name", rawdata["price"].cast(DoubleType()).alias("price"))

【讨论】：

这对我不起作用@Jaco。 OP 说他正在使用 pyspark 1.6，并且链接到的文档 you 是 1.3。当我在 1.6 上尝试这个时，我得到 AttributeError: 'DoubleType' object has no attribute 'alias' 你有导入from pyspark.sql.types import DoubleType 吗？我确信我在发布之前在 PySpark 1.6 上对此进行了测试。 FIX：应该改为rawdata.withColumn("house name",rawdata["price"].cast(DoubleType()).alias("price")【参考方案2】：

答案应该是这样的：

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: string (nullable = true)

>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: float (nullable = true)

因为它是最短的一行代码，不使用任何用户定义的函数。您可以使用printSchema() 函数查看它是否正常工作。

【讨论】：

以上是关于如何在 PySpark 1.6 中将 DataFrame 列从字符串转换为浮点/双精度？的主要内容，如果未能解决你的问题，请参考以下文章

如何在pyspark中将字符串列转换为ArrayType

如何在pyspark中将GUID转换为整数

如何在 pyspark 中将 DenseMatrix 转换为 spark DataFrame？

如何在 Pyspark 中将字符串更改为时间戳？

如何在pyspark中将列转换为行？

如何在 PySpark 中将 sql 函数与 UDAF 组合/链接