使用 PySpark 将多个数字列拟合到 spark-ml 模型中

Posted

技术标签:

【中文标题】使用 PySpark 将多个数字列拟合到 spark-ml 模型中【英文标题】:Fit multiple numeric columns into a spark-ml model with PySpark 【发布时间】:2017-06-02 13:50:18 【问题描述】:

我正在开发 Spark 1.6.2,我有一个 DataFrame 有 102 列:

f0, f1,....,f101

f0 包含索引,f101 包含标签,其他列是数字特征(浮点数)。

我想在这个 DataFrame 上训练一个随机森林模型 (spark-ml)。

所以我使用VectorAssembler 输出一个特征列以适合模型

from pyspark.ml.feature import VectorAssembler
ignore = ['f0', 'f101']
assembler = VectorAssembler(inputCols=[x for x in df.columns if x not in ignore], outputCol='features')

assembler.transform(df)
df.show()

但是没有成功,这会引发以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling o255.transform.
: org.apache.spark.SparkException: VectorAssembler does not support the StringType type

还有其他的可以将这些多列放入模型中吗?

这是我的DataFrame 的前两行:(请注意,我的所有列都是字符串类型,这可能是问题的原因)

| f0|                  f1|                 f2|                 f3|                 f4|                f5|                  f6|                 f7|                 f8|                  f9|                f10|                f11|                 f12|               f13|               f14|                f15|                f16|                 f17|                 f18|                f19|                f20|               f21|                 f22|                f23|                f24|                 f25|                f26|                 f27|                f28|                f29|                f30|               f31|                f32|                 f33|               f34|                f35|                 f36|                f37|                f38|                 f39|                f40|                f41|                 f42|                 f43|                f44|                f45|                f46|                f47|                 f48|               f49|               f50|                f51|                 f52|                f53|                f54|               f55|                 f56|                f57|               f58|                f59|                 f60|                f61|                 f62|                f63|                f64|              f65|               f66|                f67|               f68|                f69|                f70|               f71|                f72|               f73|                 f74|                 f75|                f76|                 f77|              f78|                f79|                f80|                f81|                f82|                 f83|                 f84|                f85|                f86|               f87|                f88|                f89|                f90|                f91|              f92|                f93|               f94|               f95|                 f96|               f97|                f98|                f99|               f100|f101|
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+--------------------+-------------------+-------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-------------------+------------------+-------------------+--------------------+------------------+-------------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+--------------------+------------------+------------------+-------------------+--------------------+-------------------+-------------------+------------------+--------------------+-------------------+------------------+-------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-----------------+------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+--------------------+--------------------+-------------------+--------------------+-----------------+-------------------+-------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-------------------+------------------+------------------+--------------------+------------------+-------------------+-------------------+-------------------+----+
|  0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|-0.22044000029563904|-0.6735600233078003| 0.2453099936246872|0.031005999073386192| -0.831250011920929| 0.9731900095939636|-0.04734800010919571|2.0506999492645264|0.6324499845504761| 1.0824999809265137|0.46728000044822693|  0.7816500067710876|0.011575000360608101| 0.3381200134754181|-0.2861100137233734| 3.037100076675415|-0.36792999505996704| 0.8862199783325195|-0.8241199851036072|-0.47086000442504883|-0.6407700181007385| -0.3201499879360199| 0.7545999884605408|  2.753200054168701|0.17207999527454376|-0.676639974117279|-0.8336099982261658|-0.41405001282691956|0.7059500217437744|0.37801000475883484| 0.15550999343395233|-1.0931999683380127|0.10803999751806259|-0.23667000234127045| 0.6708999872207642| 0.3448899984359741|-0.11162000149488449|  0.9600099921226501| -0.899370014667511|0.09950699657201767| -1.065000057220459|-1.3912999629974365|-0.16773000359535217|1.2430000305175781|-2.471100091934204| 1.8344999551773071|  0.6032400131225586|-0.6902700066566467|0.09102000296115875|1.7200000286102295|-0.24295000731945038|-1.8884999752044678|0.1710599958896637|-1.1556999683380127| -2.4221999645233154|-0.7604399919509888|0.014763999730348587| 0.6575700044631958|-0.5731899738311768|1.170199990272522|1.8212000131607056|0.14872999489307404|-1.582800030708313|-0.4311999976634979| -0.756820023059845|-2.511399984359741|-2.4605000019073486| 1.469599962234497|-0.49924999475479126|   2.031399965286255|-0.4928399920463562|-0.20021000504493713|0.685479998588562| -1.482100009918213|-1.6536999940872192|0.08350799977779388| 1.2898000478744507|  -2.196000099182129|-0.06448200345039368|-0.5987200140953064| 0.1709499955177307|0.8191999793052673|  0.856190025806427| 0.5832300186157227| -1.926300048828125|-0.7517899870872498|2.174499988555908|  2.433000087738037|1.6503000259399414|0.5555099844932556|  -1.583899974822998|1.7556999921798706| 0.3153800070285797|-0.1724800020456314|-0.6098300218582153| 361|
|  1|  0.6452699899673462|  0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181|-0.11873000115156174|-0.4033699929714203|0.44887998700141907| -0.3801800012588501|-1.6754000186920166|-0.4689599871635437| 0.09016799926757812|1.5816999673843384|1.4657000303268433|0.11236999928951263|0.05620399862527847|-0.00242649996653...|  1.4306999444961548|0.05022599920630455|   0.71288001537323|1.7551000118255615|-0.30507999658584595|0.40630999207496643| 1.1753000020980835|  0.4212299883365631| -2.208199977874756|-0.18940000236034393|0.21938000619411469|-0.5088800191879272|-0.5000600218772888|0.2771399915218353| 1.0090999603271484| 0.08775299787521362|0.7567399740219116| 0.4211699962615967|-0.25742998719215393|-0.6665199995040894| -0.265639990568161|  0.5249500274658203|-0.5251700282096863|-0.5188699960708618|  0.2909500002861023|-0.49011000990867615|-0.1070299968123436| 1.2991000413894653|-1.2252000570297241|-0.5937600135803223|-0.09345000237226486|1.1332999467849731|-2.444999933242798|-1.9296000003814697|-0.15282000601291656| 0.5004400014877319|-0.3229599893093109|0.5092599987983704|  0.4438900053501129|-1.2383999824523926|0.9989299774169922|-0.6500200033187866|-0.46276000142097473|0.28137001395225525|  -0.270440012216568|-1.3233000040054321| 0.4525200128555298|2.731100082397461|1.8000999689102173|-0.1950400024652481|-0.748520016670227| 0.5018399953842163|-0.6080600023269653|-1.093500018119812|-1.7791999578475952|1.1186000108718872|    1.15339994430542|-0.10273999720811844|-1.9773999452590942| 0.23173999786376953|0.604610025882721|-1.1047999858856201|-1.8122999668121338|-1.0922000408172607|0.14993999898433685|-0.23330999910831451|  0.4197700023651123|-0.5616300106048584|-1.2773000001907349|1.0683000087738037|-0.3670499920845032|0.25751999020576477|-1.1461000442504883| 0.0685959979891777|2.424999952316284|-0.2257699966430664|0.8041399717330933|0.7866700291633606|-0.45813000202178955| 1.329200029373169|0.10018999874591827| -1.253499984741211|0.01594099961221218|1047|

【问题讨论】:

您是否尝试过在VectorAssembler() 之外分配列表理解的结果,然后将其作为arg 传递? 是的,我也遇到了同样的错误,我也尝试通过这个列表 ['f1', 'f2'] 并且发生了同样的错误 你能添加你的数据框的架构吗? @eliasah 完成 :) 【参考方案1】:

我们将使用我们定义的here和concat_ws的parse_udf

from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT

dd = sc.parallelize(['0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796','1|0.6452699899673462|0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181']).map(lambda x : x.split('|'))

df = sqlContext.createDataFrame(rdd, ['f1','f2','f3','f4','f5','f6'])

ignore = ['f1','f4'] # columns to ignore
keep = [x for x in df.columns if x not in ignore] # columns to keep

parse_ = udf(Vectors.parse, VectorUDT())
parsed = df.withColumn("features", F.concat(F.lit('['), F.concat_ws(",", *keep), F.lit(']'))). \
            withColumn("features", parse_("features"))

parsed.show(truncate=False)
# +---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |f1 |f2                  |f3                 |f4                 |f5                 |f6                |features                                                                        |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |0  |-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|[-0.38672998547554016,-1.5183000564575195,1.2288000583648682,0.7216399908065796]|
# |1  |0.6452699899673462  |0.528219997882843  |-0.5653899908065796|-0.4328500032424927|0.9352899789810181|[0.6452699899673462,0.528219997882843,-0.4328500032424927,0.9352899789810181]   |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+

这应该可以。我只是用了一个比你小的例子。

【讨论】:

以上是关于使用 PySpark 将多个数字列拟合到 spark-ml 模型中的主要内容,如果未能解决你的问题,请参考以下文章

Pyspark 朴素贝叶斯批量使用拟合

在 PySpark 中的文字列上检测到 INNER 连接的笛卡尔积

Pyspark - 如何将多个数据帧的列连接成一个数据帧的列

使用 Pyspark 进行超参数调优

Cassandra / Spark显示大表的错误条目数

在 PySpark 中的多个列上应用 MinMaxScaler