使用 pyspark load 读取数据时出错

Posted

技术标签:

【中文标题】使用 pyspark load 读取数据时出错【英文标题】:Error when using pyspark load to read data 【发布时间】:2018-05-15 10:42:19 【问题描述】:

我正在尝试使用 Pyspark 加载文件,如下所示

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('mylogreg').getOrCreate()

from pyspark.ml.classification import LogisticRegression

my_data = spark.read.format('libsvm').load('cars.csv')

但它一直给我以下错误:

Py4JJavaError: An error occurred while calling o231.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 6, localhost, executor driver): java.lang.NumberFormatException: For input string: "YEAR,Make,Model,Size,(kW),Unnamed:"
    at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
    at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
    at java.lang.Double.parseDouble(Double.java:538)
    at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:284)
    at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
    at org.apache.spark.mllib.util.MLUtils$.parseLibSVMRecord(MLUtils.scala:128)
    at org.apache.spark.mllib.util.MLUtils$$anonfun$parseLibSVMFile$4.apply(MLUtils.scala:123)
    at org.apache.spark.mllib.util.MLUtils$$anonfun$parseLibSVMFile$4.apply(MLUtils.scala:123)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
    at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$14.apply(RDD.scala:1015)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$14.apply(RDD.scala:1013)
    at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2123)
    at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2123)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.ap

我可以使用普通的 RDD 而不是使用 SQLContext,但是我无法很好地查看表中的数据。

【问题讨论】:

您要读取的文件是 libsvm 格式吗? 不,这是一个 csv 文件 【参考方案1】:

我认为你应该以.csv 格式加载

my_data = spark.read.option("delimiter", ",").option("header", "false").csv('cars.csv')

【讨论】:

以上是关于使用 pyspark load 读取数据时出错的主要内容,如果未能解决你的问题,请参考以下文章

从 pyspark 中的 HDFS 读取 70gb bson 文件然后将其索引到 Elastic 时出错

使用 pyspark 从 AWS s3 Bucket 读取 csv 时出错

在 Pyspark 中从 Rest Api 创建数据框时出错

使用 pySpark 读取分号数据的管道

pyspark 读取 bigquery 时出错:java.lang.ClassNotFoundException:org.apache.spark.internal.Logging$class

在 pyspark 中读取 Hive 托管表的 orc 文件