模式匹配范围在Scala与Spark udf
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了模式匹配范围在Scala与Spark udf相关的知识,希望对你有一定的参考价值。
我有一个Spark DataFrame,其中包含我使用Likert量表与数字分数匹配的字符串。不同的问题Ids映射到不同的分数。我正在尝试使用此问题作为指南在Apache Spark udf中对Scala中的范围进行模式匹配:
How can I pattern match on a range in Scala?
但是当我使用范围而不是简单的OR语句时,我得到了编译错误,即
31 | 32 | 33 | 34
工作正常
31 to 35
不编译。请问我在语法上出错了吗?
另外,在最后的情况_中,我想映射到一个字符串而不是一个Int,case _ => "None"
,但这会给出一个错误:java.lang.UnsupportedOperationException: Schema for type Any is not supported
据推测,这是一个Spark的通用问题,因为它很可能在原生Scala中返回Any
?
这是我的代码:
def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4 //this is fine
case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
case ((31 | 32 | 33 | 34 | 35), "Often") => 2
case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
case ((x if 41 until 55 contains x), "None of the time") => 1 //this line won't compile
case _ => 0 //would like to map to "None"
})
然后,在Spark DataFrame上使用udf,如下所示:
val df3 = df.withColumn("NumericScore", calculateScore(df("QuestionId"), df("AnswerText")))
应该在模式之后放置保护表达式:
def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4
case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
case ((31 | 32 | 33 | 34 | 35), "Often") => 2
case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
case (x, "None of the time") if 41 until 55 contains x => 1
case _ => 0 //would like to map to "None"
})
如果你想把最后的case
即case _
映射到“None”String
,那么所有的情况都应该返回String
以下udf
功能应该适合你
def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => "4" //this is fine
case ((31 | 32 | 33 | 34 | 35), "Occasionally") => "3"
case ((31 | 32 | 33 | 34 | 35), "Often") => "2"
case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => "1"
case (x, "None of the time") if (x >= 41 && x < 55) => "1" //this line won't compile
case _ => "None"
})
如果你想将最后的case
,即case _
映射到None
,那么你需要改变其他返回类型作为Option
的孩子,因为None
是Option
的孩子
以下代码也应该适合您
def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => Some(4) //this is fine
case ((31 | 32 | 33 | 34 | 35), "Occasionally") => Some(3)
case ((31 | 32 | 33 | 34 | 35), "Often") => Some(2)
case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => Some(1)
case (x, "None of the time") if (x >= 41 && x < 55) => Some(1) //this line won't compile
case _ => None
})
最后一点是你有错误信息java.lang.UnsupportedOperationException: Schema for type Any is not supported
明确指出不支持返回类型为udf
的Any
函数。来自return types
的所有match cases
应该是一致的。
以上是关于模式匹配范围在Scala与Spark udf的主要内容,如果未能解决你的问题,请参考以下文章
spark read 在 Scala UDF 函数中不起作用
rdd.mapPartitions 从 Spark Scala 中的 udf 返回布尔值
使用 UDF 及其性能的 Spark Scala 数据集验证