Scala 中用于 mllib 的 java.lang.NumberFormatException

Posted

技术标签:

【中文标题】Scala 中用于 mllib 的 java.lang.NumberFormatException【英文标题】:java.lang.NumberFormatException in Scala for mllib 【发布时间】:2017-10-05 17:25:51 【问题描述】:

我只是想开发一个简单的 K-Means 算法示例,但在加载和清理数据时遇到了很多问题。

这是我的代码:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.sql.functions._


val crimeRDD = sc.textFile("/home/borja/spark/pruebas/AlgoritmoClusterizacion/filter_data1.csv")

val header = crimeRDD.first

val data = crimeRDD.filter (justData => justData != header)

//Spark doesn't allow more than 22 element
case class crimeReport (Record_ID: Int, Agency_Name: String, City: String, State: String, Year: Int, Month: String, Crime_Type: String, Crime_Solved: String, Victim_Sex: String, Victim_Age: Int, Victim_Race: String, Perpetrator_Sex: String, Perpetrator_Age: Int, Perpetrator_Race: String, Relationship: String, Victim_Count: Int)

val data_split = data.map(line => line.split(","))

val allData = data_split.map(p => crimeReport(p(0).trim.toInt, p(1).trim.toString, p(2).trim.toString, p(3).trim.toString, p(4).trim.toInt, p(5).trim.toString, p(6).trim.toString, p(7).trim.toString, p(8).trim.toString, p(9).trim.toInt, p(10).trim.toString, p(11).trim.toString, p(12).trim.toInt, p(13).trim.toString,p(14).trim.toString, p(15).trim.toInt))

val allDF = allData.toDF()
allDF.printSchema
//allDF.show(100)

val rowsRDD = allDF.rdd.map(r => (r.getInt(0),r.getString(1),r.getString(2), r.getString(3),r.getInt(4), r.getString(5), r.getString(6), r.getString(7),r.getString(8), r.getInt(9), r.getString(10), r.getString(11),r.getInt(12), r.getString(13),r.getString(14), r.getInt(15)))

rowsRDD.cache()

val features_vector = allDF.rdd.map(r => Vectors.dense(r.getInt(0)))

features_vector.cache()

val KMeansModel = KMeans.train(features_vector,2,40)

但我得到了这个错误

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 43.0 failed 1 times, most recent failure: Lost task 0.0 in stage 43.0 (TID 57, localhost, executor driver): java.lang.NumberFormatException: For input string: "Jersey"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)

我不明白,因为我正在使用函数 .trim 清理白色和空格,是吗? 关键是抛出异常,因为当我执行 .toInt 时有一些字符串值,对吗?那么我该如何过滤它们,因为有 65k 行。

这里有一些数据:

Record ID,Agency Name,City,State,Year,Month,Crime Type,Crime Solved,Victim Sex,Victim Age,Victim Race,Perpetrator Sex,Perpetrator Age,Perpetrator Race,Relationship,Victim Count
1,Anchorage,Anchorage,Alaska,1980,January,Murder or Manslaughter,Yes,Male,14,Native American/Alaska Native,Male,15,Native American/Alaska Native,Acquaintance,0
13504,Atlantic City,Atlantic,Jersey,1980,January,Murder or Manslaughter,Yes,Male,40,Black,Female,50,Black,Acquaintance,0
13505,Atlantic City,Atlantic,Jersey,1980,January,Murder or Manslaughter,No,Male,23,Black,Unknown,0,Unknown,Unknown,0
13506,Atlantic City,Atlantic,Jersey,1980,January,Murder or Manslaughter,No,Male,52,White,Unknown,0,Unknown,Unknown,0
13507,Atlantic City,Atlantic,Jersey,1980,March,Murder or Manslaughter,Yes,Male,35,Black,Male,23,Black,Unknown,0
13508,Atlantic City,Atlantic,Jersey,1980,March,Murder or Manslaughter,No,Male,25,Black,Unknown,0,Unknown,Unknown,0
13647,Jersey City,Hudson,Jersey,1980,October,Murder or Manslaughter,No,Female,50,White,Unknown,0,Unknown,Unknown,2
13648,Jersey City,Hudson,Jersey,1980,March,Murder or Manslaughter,Yes,Female,60,White,Male,36,White,Father,1
13649,Jersey City,Hudson,Jersey,1980,June,Murder or Manslaughter,Yes,Female,52,Black,Male,26,Black,Unknown,1
13650,Jersey City,Hudson,Jersey,1980,October,Murder or Manslaughter,No,Male,2,White,Unknown,0,Unknown,Unknown,2
13651,Jersey City,Hudson,Jersey,1980,January,Murder or Manslaughter,Yes,Female,68,Black,Male,0,Black,Unknown,0
13652,Jersey City,Hudson,Jersey,1980,January,Murder or Manslaughter,Yes,Female,22,Black,Male,23,Black,Unknown,0
13653,Jersey City,Hudson,Jersey,1980,January,Murder or Manslaughter,Yes,Female,16,White,Male,33,White,Acquaintance,0
13654,Jersey City,Hudson,Jersey,1980,January,Murder or Manslaughter,Yes,Male,34,White,Male,18,White,Acquaintance,0
13655,Jersey City,Hudson,Jersey,1980,February,Murder or Manslaughter,No,Male,29,Black,Unknown,0,Unknown,Unknown,0
13656,Jersey City,Hudson,Jersey,1980,February,Murder or Manslaughter,No,Male,42,White,Unknown,0,Unknown,Unknown,0

【问题讨论】:

查找包含Jersey 的行(没有示例)并检查在您的清洁步骤之后它是如何保持的。某处可能存在错误(其中一个值包含字符串?) 抱歉,没有包含带有 "Jersey" 的行。现在数据出现在转换之后。关键是抛出异常,因为当我执行 .toInt 时有一些字符串值,对吗?那么我该如何过滤它们,因为有 65k 行 Trydata_split.map(p => crimeReport(...)).recover case e: Throwable => logger.error(s"bad data: $p"); throw e 【参考方案1】:

是的,当 object 无法转换为 integer 值时,将引发错误消息。 你为什么不试试OptionTry 例如

Option(p(0).trim.toInt) getOrElse 0

或者

Try(p(0).trim.toInt) getOrElse 0

这应该足以避免此类转换错误

您在代码中提到case class 仅支持 22 个元素。此限制已在较新版本的 spark 中删除。 希望回答对你有帮助

已编辑:注意: Option 仅捕获 NullPointerException,因此不应解决强制转换异常。 Try 处理几乎所有类型的Exception,因此Try 应该是强制转换异常的选项。

【讨论】:

这是非常糟糕的方法,因为它隐藏了问题。 这是我所知道的最好的方法。如果我们想让问题显而易见(而不是隐藏),那么我们可以添加 match case 并抛出必要的异常。 感谢大家的帮助!最后它与 Try 一起工作。使用 Option 我仍然遇到同样的错误,但我不知道为什么,有人可以告诉我原因吗?目前,如果问题被隐藏就可以了。 @Borja,当然可以。我忘了OptionNullPointerException 只被处理,但Try 所有exceptions 都被处理。让我更新答案。【参考方案2】:

您可以使用Try 来处理异常。您可以在 scala 中使用旧的 Java 风格 try/catchTry 的功能方式。下面是scala中处理异常的函数式方法。

Try(p(0).trim.toInt) match 
    case Success(result) => result  
    case Failure (ex) => 
      ex.printStackTrace()
      //return a default value as you want i have returned 0
      0
    
  

Try 如果语句通过则返回 Success,如果语句失败或发生异常则返回 Failure。 我希望这更容易理解,也更实用。

【讨论】:

以上是关于Scala 中用于 mllib 的 java.lang.NumberFormatException的主要内容,如果未能解决你的问题,请参考以下文章

scala加载spark MLlib等所有相关jar的问题

Spark MLlib速成宝典基础篇01Windows下spark开发环境搭建(Scala版)

梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python)

二十种特征变换方法及Spark MLlib调用实例(Scala/Java/python)

基于spark mllib的LDA模型训练Scala代码实现

通过python扩展spark mllib 算法包(e.g.基于spark使用孤立森林进行异常检测)