PySpark UDF 测试从 String 到 Int 的转换

Posted 2023-03-31

技术标签:

【中文标题】PySpark UDF 测试从 String 到 Int 的转换【英文标题】：PySpark UDF to test conversion from String to Int 【发布时间】：2017-08-22 10:21:38 【问题描述】：

我想应用一个 udf 函数，该函数只有在可以转换为 int 时才返回原始值。直到现在我尝试了 2 个功能：

def nb_int(s):
    try:
        val=int(s)
        return s
    except:
        return "ERROR"

def nb_digit(s):
    if (s.isdigit() == True):
        return s
    else:
        return "ERROR"
nb_udf = F.udf(nb_digit, StringType())
df_corrected=df.withColumn("IntValue",nb_udf("nb_value"))

我在“nb_value”上应用了这个函数。但它不起作用：

df_corrected.filter(df_corrected["IntValue"] == "ERROR").select("nb_value").dropDuplicates(subset=["nb_value"]).collect()

最后一行的结果应该只给我不可转换的值，但我仍然有 1、2、4 等...

[Row(nb_value=u'MS'),
 Row(nb_value=u'286'),
 Row(nb_value=u'TB'),
 Row(nb_value=u'GF'),
 Row(nb_value=u'287'),
 Row(nb_value=u'MU'),
 Row(nb_value=u'170'),
 Row(nb_value=u'A9'),
 Row(nb_value=u'288'),
 Row(nb_value=u'171'),
 Row(nb_value=u'333'),....

欢迎任何帮助解决它！谢谢

【问题讨论】：

【参考方案1】：

您的两个 UDF 都为我工作。你最好使用第二个，因为引发异常会减慢火花。 == True 在 nb_digit 中是多余的，但除此之外它很好：

首先让我们创建示例数据框：

df = hc.createDataFrame(sc.parallelize([['MS'],['286'],['TB'],['GF'],['287'],['MU'],['170'],['A9'],['288'],['171'],['333']]), ["nb_value"])

使用您的任何 UDF：

df_corrected = df.withColumn("IntValue", nb_udf("nb_value"))
df_corrected.filter(df_corrected["IntValue"] == "ERROR").select("nb_value").dropDuplicates(subset=["nb_value"]).collect()

[Row(nb_value=u'MS'),
 Row(nb_value=u'TB'),
 Row(nb_value=u'GF'),
 Row(nb_value=u'A9'),
 Row(nb_value=u'MU')]

【讨论】：

以上是关于PySpark UDF 测试从 String 到 Int 的转换的主要内容，如果未能解决你的问题，请参考以下文章