ValueError:无法将字符串转换为浮点数

Posted

技术标签:

【中文标题】ValueError:无法将字符串转换为浮点数【英文标题】:ValueError: could not convert string to float 【发布时间】:2015-08-19 14:40:55 【问题描述】:

我有一个包含一些数据的文本文件。数据如下

join2_train = sc.textFile('join2_train.csv',4)
join2_train.take(3)   

 [u'21.9059,TA-00002,S-0066,7/7/2013,0,0,Yes,1,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0',
     u'12.3412,TA-00002,S-0066,7/7/2013,0,0,Yes,2,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0',
     u'6.60183,TA-00002,S-0066,7/7/2013,0,0,Yes,5,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0']

现在我正在尝试将此字符串解析为一个函数,该函数将每一行文本拆分并转换为 LabeledPoint。我还包括了一行用于将字符串元素转换为浮点数

函数如下

from pyspark.mllib.regression import LabeledPoint
import numpy as np

def parsePoint(line):


    """Converts a comma separated unicode string into a `LabeledPoint`.

    Args:
        line (unicode): Comma separated unicode string where the first element is the label and the
            remaining elements are features.

    Returns:
        LabeledPoint: The line is converted into a `LabeledPoint`, which consists of a label and
            features.
    """
    values = line.split(',')
    value1 = [map(float,i) for i in values]
    return LabeledPoint(value1[0],value1[1:]) 

现在,当我尝试在这条解析的行上执行一些操作时,我得到了 ValueError。我尝试做的动作如下

parse_train = join2_train.map(parsePoint)

parse_train.take(5)

我得到的错误信息如下

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-63-f53b10964381> in <module>()
      1 parse_train = join2_train.map(parsePoint)
      2 
----> 3 parse_train.take(5)

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in take(self, num)
   1222 
   1223             p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1224             res = self.context.runJob(self, takeUpToNumLeft, p, True)
   1225 
   1226             items += res

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
    840         mappedRDD = rdd.mapPartitions(partitionFunc)
    841         port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions,
--> 842                                           allowLocal)
    843         return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
    844 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling 012.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 31, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main
    process()
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1220, in takeUpToNumLeft
    yield next(iterator)
  File "<ipython-input-62-0243c4dd1876>", line 18, in parsePoint
ValueError: could not convert string to float: .

    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
    at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
    at org.apache.spark.scheduler.Task.run(Task.scala:64)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

【问题讨论】:

错误中冒号后面的内容 (.) 告诉您它正在尝试转换一些不是它识别为数字的东西 (.)。清理您的数据或不要将不良数据传递给该函数。 @chicks 无法识别小数点。如果您删除所有数据并仅保留上述示例中的第一个数字:“21.9059”,则仍然存在错误。知道为什么吗? 【参考方案1】:

添加这个函数来检查字符串是否可以转换为浮点数:

def isfloat(string):
    try:
        float(string)
        return True
    except ValueError:
        return False

然后在 parsePoint 中:

value1 = [map(float,i) for i in values if isfloat(i)]

通过如下修改浮动线

value1 = [float(i) for i in values]

然后解析一个只有数值的字符串,我们可以得到正确的LabeledPoints。然而,真正的问题是试图从无法转换为像 TA-00002 那样的 join2_train 对象中的浮动的字符串中生成 LabeledPoint 对象

【讨论】:

有趣的是,这不适用于值“21.9059”。由于某种原因,它无法识别小数点。 isfloat 将返回 True,但映射仍然会导致错误。 你的意思是当 values = '21.9059.'或 i = '21.9059' ? 我的意思是当 csv 文件 join2_train.csv (如在 OP 的帖子中)仅包含一行 21.9059 时。 即当值 = [u'21.9059'] 我将 ParsePoint 函数的第二行修改为 value1 = [float(i) for i in values] 。完成此操作并通过函数解析具有此处给出的值的字符串 templp1 = '21.9059,0,0,6.35,0.71,137,8,19.05' 后,将计算正确的浮点对象。然而,真正的问题是在尝试将 TA-00002 之类的字符串转换为 float 时,它是joint2_train 的一部分。

以上是关于ValueError:无法将字符串转换为浮点数的主要内容,如果未能解决你的问题,请参考以下文章

数字后带有减号的 CSV 文件。 “ValueError:无法将字符串转换为浮点数:”

在处理 csv 文件时将字符串转换为浮点值

我收到 ValueError:无法将字符串转换为浮点数:'8,900' [重复]

ValueError:无法将字符串转换为float:'“”'

C# 将字符转换为双精度浮点型

ValueError:无法将字符串转换为浮点数:E-3 [重复]