连接多行 Pyspark
Posted
技术标签:
【中文标题】连接多行 Pyspark【英文标题】:concatenating multiple rows Pyspark 【发布时间】:2018-07-07 12:03:00 【问题描述】:我需要将以下数据合并为一行:
vector_no_stopw_df.select("filtered").show(3, truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[, problem, population] |
|[tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries] |
|[many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
这样看起来像
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[, problem, population,tyler, notes, global, population, increase, sharply, next, century, , almost, growth, occurring, relatively, underdeveloped, africa, south, asia, , contrast, , population, actually, decline, countries,many, economists, uncomfortable, population, issues, , perhaps, arent, covered, depth, standard, graduate, curriculum, , touch, topics, may, culturally, controversial, even, politically, incorrect, thats, unfortunate, future]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
我知道这是微不足道的。但找不到解决方案。我试过concat_ws
还是不行。
concat_ws
,我执行生成(vector_no_stopw_df.select(concat_ws(',', vector_no_stopw_df.filtered)).collect()
)如下:
[Row(concat_ws(,, filtered)='one,big,advantages,economist,long,time,council,economic,advisers,,years,ago,ive,gotten,know,follow,lot,people,thinking,,started,cea,august,,finished,july,,,first,academic,year,,fellow,senior,economists,paul,krugman,,lawrence,summers'),
Row(concat_ws(,, filtered)='isnt,going,happen,anytime,soon,meantime,,tax,system,puts,place,much,higher,marginal,rates,people,acknowledge,people,keep,focusing,federal,income,taxes,alone,,marginal,rates,top,around,,percent,leaves,state'),
Row(concat_ws(,, filtered)=',,
以下是解决方案,以防万一其他人需要它:
我继续使用 python 的 itertools
库。
vector_no_stopw_df_count=vector_no_stopw_df.select("filtered").collect()
vector_no_stopw_df_count[0].filtered
vector_no_stopw_list=[i.filtered for i in vector_no_stopw_df_count]
扁平化列表
from itertools import chain
flattenlist= list(chain.from_iterable(vector_no_stopw_list))
flattenlist[:20]
结果:
['',
'problem',
'population',
'tyler',
'notes',
'global',
'population',
'increase',
'sharply',
'next',
'century',
'',
'almost',
'growth',
'occurring',
'relatively',
'underdeveloped',
'africa',
'south',
'asia']
【问题讨论】:
行分组的条件是什么?你怎么知道哪 3 行需要合并为一行? 我添加了一个粗略的pythonic方法来解决这个问题。我想通过 pyspark 解决它但无法解决它,所以继续使用 itertools。见上文 事实上,我不想将哪个与哪个合并。我只是想创建一个扁平化列表。 【参考方案1】:在某种程度上,您正在寻找 explode
的反面。
您可以为此使用collect_list
:
from pyspark.sql.functions as F
df.groupBy(<somecol>).agg(F.collect_list('filtered').alias('aggregated_filters'))
【讨论】:
以上是关于连接多行 Pyspark的主要内容,如果未能解决你的问题,请参考以下文章