数据框列中的嵌套列表,提取数据框列中列表的值 Pyspark Spark
Posted
技术标签:
【中文标题】数据框列中的嵌套列表,提取数据框列中列表的值 Pyspark Spark【英文标题】:Nested list within a dataframe colum, extracting the values of list within a dataframe column Pyspark Spark 【发布时间】:2020-07-16 22:27:13 【问题描述】:请在下面的match_event
数据框中转换tags
列
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
|eventId| eventName| eventSec| id|matchId|matchPeriod|playerId| positions|subEventId| subEventName| tags|teamId|
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
| 8| Pass| 1.255989999999997|88178642|1694390| 1H| 26010|[[50, 48], [47, 50]]| 85| Simple pass| [[1801]]| 4418|
| 8| Pass|2.3519079999999803|88178643|1694390| 1H| 3682|[[47, 50], [41, 48]]| 85| Simple pass| [[1801]]| 4418|
| 8| Pass|3.2410280000000284|88178644|1694390| 1H| 31528|[[41, 48], [32, 35]]| 85| Simple pass| [[1801]]| 4418|
| 8| Pass| 6.033681000000001|88178645|1694390| 1H| 7855| [[32, 35], [89, 6]]| 83| High pass| [[1802]]| 4418|
| 1| Duel|13.143591000000015|88178646|1694390| 1H| 25437| [[89, 6], [85, 0]]| 12|Ground defending ...| [[702], [1801]]| 4418|
| 1| Duel|14.138041000000044|88178663|1694390| 1H| 83575|[[11, 94], [15, 1...| 11|Ground attacking ...| [[702], [1801]]| 11944|
| 3| Free Kick|27.053005999999982|88178648|1694390| 1H| 7915| [[85, 0], [93, 16]]| 36| Throw in| [[1802]]| 4418|
| 8| Pass| 28.97515999999996|88178667|1694390| 1H| 70090| [[7, 84], [9, 71]]| 82| Head pass| [[1401], [1802]]| 11944|
| 10| Shot| 31.22621700000002|88178649|1694390| 1H| 25437| [[91, 29], [0, 0]]| 100| Shot|[[402], [1401], [...| 4418|
| 9|Save attempt| 32.66416000000004|88178674|1694390| 1H| 83574|[[100, 100], [15,...| 91| Save attempt| [[1203], [1801]]| 11944|
+-------+------------+------------------+--------+-------+-----------+--------+--------------------+----------+--------------------+--------------------+------+
类似于这样,将列表中的最后一项提取到如下所示的列中
+----+
|tags|
+----+
|1801|
|1801|
|1801|
|1802|
|1801|
|1801|
+----+
该列将重新附加到 match_event
数据框,可能使用 withColumn
我试过下面的代码
u = match_event[['tags']].rdd
t=u.map(lambda xs: [n for x in xs[-1:] for n in x[-1:]])
tag = spark.createDataFrame(t, ['tag'])
我明白了。使用withColumn
很难进一步实现
+------+
| tag|
+------+
|[1801]|
|[1801]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1802]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
|[1302]|
|[1802]|
|[1801]|
|[1802]|
|[1801]|
|[1801]|
|[1801]|
|[1801]|
+------+
请帮忙。提前致谢
【问题讨论】:
【参考方案1】:spark2.4+
使用 element_at
。
df.withColumn("lastItem", F.element_at("tags",-1)[0]).show()
#+---------------+--------+
#| tags|lastItem|
#+---------------+--------+
#|[[1], [2], [3]]| 3|
#|[[1], [2], [3]]| 3|
#+---------------+--------+
【讨论】:
【参考方案2】:试试这个:
from pyspark.sql.functions import udf
columns = ['eventId', 'eventName','eventSec', 'id','matchId','matchPeriod','playerId', 'positions','subEventId','subEventName', tags','teamId']
vals = [ ( 8, "Pass", 1.255989999999997,88178642,1694390,"1H", 26010,[[50, 48], [47, 50]],85,"Simple pass",[[1801]], 4418),
( 1,"Duel",13.143591000000015,88178646,1694390,"1H",25437, [[89, 6], [85, 0]],12,"Ground defending",[[702], [1801]], 4418)
]
udf1 =spark.udf.register("Lastcol", lambda xs: [n for x in xs[-1:] for n in x[-1:]])
df = spark.createDataFrame(vals, columns)
df2 = df.withColumn( 'created_col',udf1('tags')).show()
【讨论】:
以上是关于数据框列中的嵌套列表,提取数据框列中列表的值 Pyspark Spark的主要内容,如果未能解决你的问题,请参考以下文章