ValueError:无法将字符串转换为浮点数
Posted
技术标签:
【中文标题】ValueError:无法将字符串转换为浮点数【英文标题】:ValueError: could not convert string to float 【发布时间】:2015-08-19 14:40:55 【问题描述】:我有一个包含一些数据的文本文件。数据如下
join2_train = sc.textFile('join2_train.csv',4)
join2_train.take(3)
[u'21.9059,TA-00002,S-0066,7/7/2013,0,0,Yes,1,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0',
u'12.3412,TA-00002,S-0066,7/7/2013,0,0,Yes,2,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0',
u'6.60183,TA-00002,S-0066,7/7/2013,0,0,Yes,5,SP-0019,6.35,0.71,137,8,19.05,N,N,N,N,EF-008,EF-008,0,0,0']
现在我正在尝试将此字符串解析为一个函数,该函数将每一行文本拆分并转换为 LabeledPoint。我还包括了一行用于将字符串元素转换为浮点数
函数如下
from pyspark.mllib.regression import LabeledPoint
import numpy as np
def parsePoint(line):
"""Converts a comma separated unicode string into a `LabeledPoint`.
Args:
line (unicode): Comma separated unicode string where the first element is the label and the
remaining elements are features.
Returns:
LabeledPoint: The line is converted into a `LabeledPoint`, which consists of a label and
features.
"""
values = line.split(',')
value1 = [map(float,i) for i in values]
return LabeledPoint(value1[0],value1[1:])
现在,当我尝试在这条解析的行上执行一些操作时,我得到了 ValueError。我尝试做的动作如下
parse_train = join2_train.map(parsePoint)
parse_train.take(5)
我得到的错误信息如下
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-63-f53b10964381> in <module>()
1 parse_train = join2_train.map(parsePoint)
2
----> 3 parse_train.take(5)
/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in take(self, num)
1222
1223 p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1224 res = self.context.runJob(self, takeUpToNumLeft, p, True)
1225
1226 items += res
/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
840 mappedRDD = rdd.mapPartitions(partitionFunc)
841 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions,
--> 842 allowLocal)
843 return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
844
/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539
540 for temp_arg in temp_args:
/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling 012.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage 22.0 (TID 31, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 101, in main
process()
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/serializers.py", line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1220, in takeUpToNumLeft
yield next(iterator)
File "<ipython-input-62-0243c4dd1876>", line 18, in parsePoint
ValueError: could not convert string to float: .
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
【问题讨论】:
错误中冒号后面的内容 (.
) 告诉您它正在尝试转换一些不是它识别为数字的东西 (.
)。清理您的数据或不要将不良数据传递给该函数。
@chicks 无法识别小数点。如果您删除所有数据并仅保留上述示例中的第一个数字:“21.9059”,则仍然存在错误。知道为什么吗?
【参考方案1】:
添加这个函数来检查字符串是否可以转换为浮点数:
def isfloat(string):
try:
float(string)
return True
except ValueError:
return False
然后在 parsePoint 中:
value1 = [map(float,i) for i in values if isfloat(i)]
通过如下修改浮动线
value1 = [float(i) for i in values]
然后解析一个只有数值的字符串,我们可以得到正确的LabeledPoints。然而,真正的问题是试图从无法转换为像 TA-00002
那样的 join2_train
对象中的浮动的字符串中生成 LabeledPoint 对象
【讨论】:
有趣的是,这不适用于值“21.9059”。由于某种原因,它无法识别小数点。 isfloat 将返回 True,但映射仍然会导致错误。 你的意思是当 values = '21.9059.'或 i = '21.9059' ? 我的意思是当 csv 文件 join2_train.csv (如在 OP 的帖子中)仅包含一行 21.9059 时。 即当值 = [u'21.9059'] 我将 ParsePoint 函数的第二行修改为 value1 = [float(i) for i in values] 。完成此操作并通过函数解析具有此处给出的值的字符串 templp1 = '21.9059,0,0,6.35,0.71,137,8,19.05' 后,将计算正确的浮点对象。然而,真正的问题是在尝试将 TA-00002 之类的字符串转换为 float 时,它是joint2_train 的一部分。以上是关于ValueError:无法将字符串转换为浮点数的主要内容,如果未能解决你的问题,请参考以下文章
数字后带有减号的 CSV 文件。 “ValueError:无法将字符串转换为浮点数:”
我收到 ValueError:无法将字符串转换为浮点数:'8,900' [重复]