数据框在多列上连接,pyspark中的列有一些条件[重复]
Posted
技术标签:
【中文标题】数据框在多列上连接,pyspark中的列有一些条件[重复]【英文标题】:Dataframe join on multiple columns with some conditions on columns in pyspark [duplicate] 【发布时间】:2018-05-25 11:15:43 【问题描述】:df = sqlContext.sql("select d1.a, d1.b, d1.c as aaa, d2.d, d2.e, d2.f, d2.g, d2.h, d2.i, d2.j as length, '1' as month_end from df1 d1 join df2 d2 on concat(substr(upper(trim(d1.a)),0,d1.j),' ') = substr(upper(trim(d2.j)),0,(d2.j+1)) and upper(trim(d1.c)) = upper(trim(d2.f)) where length(upper(trim(d2.i))) > d2.j and length(upper(trim(d1.a))) = (d1.j+3)".format(dataBase, month_end))
谁能帮我把上面的join转换成dataframe join而不是sql join。
试过了:
joinDf = df1.join(df2,on=[(concat(substring(upper(trim(df1["a"])),0,df1["j"]),' ')) == substring(upper(trim(df2["j"])),0,(df2["j"]+1)) and upper(trim(df1["c"])) == upper(trim(df2["f"]))])
(没有选择)
出现错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/pyspark/sql/functions.py", line 1180, in substring
return Column(sc._jvm.functions.substring(_to_java_column(str), pos, len))
File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 798, in __call__
File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 785, in _get_args
File "/opt/cloudera/parcels/CDH-5.10.2-1.cdh5.10.2.p2667.3017/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_collections.py", line 512, in convert
TypeError: 'Column' object is not callable
【问题讨论】:
【参考方案1】:您不能将函数用于平面类型(如string
)并将其应用于Column
类型。
(substring
,upper
,trim
等需要更换)
您需要实现自己的 UDF 或使用 pyspark.sql.functions
模块中的函数:
http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions
【讨论】:
你是对的,但我怀疑在这种情况下 OP 做了from pyspark.sql.functions import *
- 所以函数可能不是字符串函数。您也不能在列上使用and
。需要使用位运算符。以上是关于数据框在多列上连接,pyspark中的列有一些条件[重复]的主要内容,如果未能解决你的问题,请参考以下文章
Pyspark - 如何将多个数据帧的列连接成一个数据帧的列