使用 PySpark 将多个数字列拟合到 spark-ml 模型中
Posted
技术标签:
【中文标题】使用 PySpark 将多个数字列拟合到 spark-ml 模型中【英文标题】:Fit multiple numeric columns into a spark-ml model with PySpark 【发布时间】:2017-06-02 13:50:18 【问题描述】:我正在开发 Spark 1.6.2,我有一个 DataFrame
有 102 列:
f0, f1,....,f101
f0 包含索引,f101 包含标签,其他列是数字特征(浮点数)。
我想在这个 DataFrame
上训练一个随机森林模型 (spark-ml
)。
所以我使用VectorAssembler
输出一个特征列以适合模型
from pyspark.ml.feature import VectorAssembler
ignore = ['f0', 'f101']
assembler = VectorAssembler(inputCols=[x for x in df.columns if x not in ignore], outputCol='features')
assembler.transform(df)
df.show()
但是没有成功,这会引发以下错误:
py4j.protocol.Py4JJavaError: An error occurred while calling o255.transform.
: org.apache.spark.SparkException: VectorAssembler does not support the StringType type
还有其他的可以将这些多列放入模型中吗?
这是我的DataFrame
的前两行:(请注意,我的所有列都是字符串类型,这可能是问题的原因)
| f0| f1| f2| f3| f4| f5| f6| f7| f8| f9| f10| f11| f12| f13| f14| f15| f16| f17| f18| f19| f20| f21| f22| f23| f24| f25| f26| f27| f28| f29| f30| f31| f32| f33| f34| f35| f36| f37| f38| f39| f40| f41| f42| f43| f44| f45| f46| f47| f48| f49| f50| f51| f52| f53| f54| f55| f56| f57| f58| f59| f60| f61| f62| f63| f64| f65| f66| f67| f68| f69| f70| f71| f72| f73| f74| f75| f76| f77| f78| f79| f80| f81| f82| f83| f84| f85| f86| f87| f88| f89| f90| f91| f92| f93| f94| f95| f96| f97| f98| f99| f100|f101|
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+--------------------+-------------------+-------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-------------------+------------------+-------------------+--------------------+------------------+-------------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+--------------------+------------------+------------------+-------------------+--------------------+-------------------+-------------------+------------------+--------------------+-------------------+------------------+-------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-----------------+------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+--------------------+--------------------+-------------------+--------------------+-----------------+-------------------+-------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-------------------+------------------+------------------+--------------------+------------------+-------------------+-------------------+-------------------+----+
| 0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|-0.22044000029563904|-0.6735600233078003| 0.2453099936246872|0.031005999073386192| -0.831250011920929| 0.9731900095939636|-0.04734800010919571|2.0506999492645264|0.6324499845504761| 1.0824999809265137|0.46728000044822693| 0.7816500067710876|0.011575000360608101| 0.3381200134754181|-0.2861100137233734| 3.037100076675415|-0.36792999505996704| 0.8862199783325195|-0.8241199851036072|-0.47086000442504883|-0.6407700181007385| -0.3201499879360199| 0.7545999884605408| 2.753200054168701|0.17207999527454376|-0.676639974117279|-0.8336099982261658|-0.41405001282691956|0.7059500217437744|0.37801000475883484| 0.15550999343395233|-1.0931999683380127|0.10803999751806259|-0.23667000234127045| 0.6708999872207642| 0.3448899984359741|-0.11162000149488449| 0.9600099921226501| -0.899370014667511|0.09950699657201767| -1.065000057220459|-1.3912999629974365|-0.16773000359535217|1.2430000305175781|-2.471100091934204| 1.8344999551773071| 0.6032400131225586|-0.6902700066566467|0.09102000296115875|1.7200000286102295|-0.24295000731945038|-1.8884999752044678|0.1710599958896637|-1.1556999683380127| -2.4221999645233154|-0.7604399919509888|0.014763999730348587| 0.6575700044631958|-0.5731899738311768|1.170199990272522|1.8212000131607056|0.14872999489307404|-1.582800030708313|-0.4311999976634979| -0.756820023059845|-2.511399984359741|-2.4605000019073486| 1.469599962234497|-0.49924999475479126| 2.031399965286255|-0.4928399920463562|-0.20021000504493713|0.685479998588562| -1.482100009918213|-1.6536999940872192|0.08350799977779388| 1.2898000478744507| -2.196000099182129|-0.06448200345039368|-0.5987200140953064| 0.1709499955177307|0.8191999793052673| 0.856190025806427| 0.5832300186157227| -1.926300048828125|-0.7517899870872498|2.174499988555908| 2.433000087738037|1.6503000259399414|0.5555099844932556| -1.583899974822998|1.7556999921798706| 0.3153800070285797|-0.1724800020456314|-0.6098300218582153| 361|
| 1| 0.6452699899673462| 0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181|-0.11873000115156174|-0.4033699929714203|0.44887998700141907| -0.3801800012588501|-1.6754000186920166|-0.4689599871635437| 0.09016799926757812|1.5816999673843384|1.4657000303268433|0.11236999928951263|0.05620399862527847|-0.00242649996653...| 1.4306999444961548|0.05022599920630455| 0.71288001537323|1.7551000118255615|-0.30507999658584595|0.40630999207496643| 1.1753000020980835| 0.4212299883365631| -2.208199977874756|-0.18940000236034393|0.21938000619411469|-0.5088800191879272|-0.5000600218772888|0.2771399915218353| 1.0090999603271484| 0.08775299787521362|0.7567399740219116| 0.4211699962615967|-0.25742998719215393|-0.6665199995040894| -0.265639990568161| 0.5249500274658203|-0.5251700282096863|-0.5188699960708618| 0.2909500002861023|-0.49011000990867615|-0.1070299968123436| 1.2991000413894653|-1.2252000570297241|-0.5937600135803223|-0.09345000237226486|1.1332999467849731|-2.444999933242798|-1.9296000003814697|-0.15282000601291656| 0.5004400014877319|-0.3229599893093109|0.5092599987983704| 0.4438900053501129|-1.2383999824523926|0.9989299774169922|-0.6500200033187866|-0.46276000142097473|0.28137001395225525| -0.270440012216568|-1.3233000040054321| 0.4525200128555298|2.731100082397461|1.8000999689102173|-0.1950400024652481|-0.748520016670227| 0.5018399953842163|-0.6080600023269653|-1.093500018119812|-1.7791999578475952|1.1186000108718872| 1.15339994430542|-0.10273999720811844|-1.9773999452590942| 0.23173999786376953|0.604610025882721|-1.1047999858856201|-1.8122999668121338|-1.0922000408172607|0.14993999898433685|-0.23330999910831451| 0.4197700023651123|-0.5616300106048584|-1.2773000001907349|1.0683000087738037|-0.3670499920845032|0.25751999020576477|-1.1461000442504883| 0.0685959979891777|2.424999952316284|-0.2257699966430664|0.8041399717330933|0.7866700291633606|-0.45813000202178955| 1.329200029373169|0.10018999874591827| -1.253499984741211|0.01594099961221218|1047|
【问题讨论】:
您是否尝试过在VectorAssembler()
之外分配列表理解的结果,然后将其作为arg 传递?
是的,我也遇到了同样的错误,我也尝试通过这个列表 ['f1', 'f2'] 并且发生了同样的错误
你能添加你的数据框的架构吗?
@eliasah 完成 :)
【参考方案1】:
我们将使用我们定义的here和concat_ws
的parse_udf
from pyspark.sql.functions import udf
from pyspark.ml.feature import StringIndexer
from pyspark.mllib.linalg import Vectors, VectorUDT
dd = sc.parallelize(['0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796','1|0.6452699899673462|0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181']).map(lambda x : x.split('|'))
df = sqlContext.createDataFrame(rdd, ['f1','f2','f3','f4','f5','f6'])
ignore = ['f1','f4'] # columns to ignore
keep = [x for x in df.columns if x not in ignore] # columns to keep
parse_ = udf(Vectors.parse, VectorUDT())
parsed = df.withColumn("features", F.concat(F.lit('['), F.concat_ws(",", *keep), F.lit(']'))). \
withColumn("features", parse_("features"))
parsed.show(truncate=False)
# +---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |f1 |f2 |f3 |f4 |f5 |f6 |features |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
# |0 |-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|[-0.38672998547554016,-1.5183000564575195,1.2288000583648682,0.7216399908065796]|
# |1 |0.6452699899673462 |0.528219997882843 |-0.5653899908065796|-0.4328500032424927|0.9352899789810181|[0.6452699899673462,0.528219997882843,-0.4328500032424927,0.9352899789810181] |
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+
这应该可以。我只是用了一个比你小的例子。
【讨论】:
以上是关于使用 PySpark 将多个数字列拟合到 spark-ml 模型中的主要内容,如果未能解决你的问题,请参考以下文章
在 PySpark 中的文字列上检测到 INNER 连接的笛卡尔积