从 RDD 创建 df 时出现 pyspark 错误:TypeError:无法推断类型的架构:<type 'float'>
Posted
技术标签:
【中文标题】从 RDD 创建 df 时出现 pyspark 错误:TypeError:无法推断类型的架构:<type \'float\'>【英文标题】:pyspark error when creating df from RDD: TypeError: Can not infer schema for type: <type 'float'>从 RDD 创建 df 时出现 pyspark 错误:TypeError:无法推断类型的架构:<type 'float'> 【发布时间】:2016-09-28 22:37:06 【问题描述】:我正在使用以下代码将我的 rdd 转换为数据框:
time_df = time_rdd.toDF(['my_time'])
并得到以下错误:
TypeErrorTraceback (most recent call last)
<ipython-input-40-ab9e3025f679> in <module>()
----> 1 time_df = time_rdd.toDF(['my_time'])
/usr/local/spark-latest/python/pyspark/sql/session.py in toDF(self, schema, sampleRatio)
55 [Row(name=u'Alice', age=1)]
56 """
---> 57 return sparkSession.createDataFrame(self, schema, sampleRatio)
58
59 RDD.toDF = toDF
/usr/local/spark-latest/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio)
518
519 if isinstance(data, RDD):
--> 520 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
521 else:
522 rdd, schema = self._createFromLocal(map(prepare, data), schema)
/usr/local/spark-latest/python/pyspark/sql/session.py in _createFromRDD(self, rdd, schema, samplingRatio)
358 """
359 if schema is None or isinstance(schema, (list, tuple)):
--> 360 struct = self._inferSchema(rdd, samplingRatio)
361 converter = _create_converter(struct)
362 rdd = rdd.map(converter)
/usr/local/spark-latest/python/pyspark/sql/session.py in _inferSchema(self, rdd, samplingRatio)
338
339 if samplingRatio is None:
--> 340 schema = _infer_schema(first)
341 if _has_nulltype(schema):
342 for row in rdd.take(100)[1:]:
/usr/local/spark-latest/python/pyspark/sql/types.py in _infer_schema(row)
987
988 else:
--> 989 raise TypeError("Can not infer schema for type: %s" % type(row))
990
991 fields = [StructField(k, _infer_type(v), True) for k, v in items]
TypeError: Can not infer schema for type: <type 'float'>
有人知道我错过了什么吗?谢谢!
【问题讨论】:
【参考方案1】:你应该把浮点数转换成元组,比如
time_rdd.map(lambda x: (x, )).toDF(['my_time'])
【讨论】:
【参考方案2】:检查你的 time_rdd 是否为 RDD。
你得到了什么:
>>>type(time_rdd)
>>>dir(time_rdd)
【讨论】:
以上是关于从 RDD 创建 df 时出现 pyspark 错误:TypeError:无法推断类型的架构:<type 'float'>的主要内容,如果未能解决你的问题,请参考以下文章
PySpark:从现有的 LabeledPointsRDD 创建新的 RDD,但修改标签