Pyspark SQL拆分数据框行的记录[重复]

Posted

技术标签:

【中文标题】Pyspark SQL拆分数据框行的记录[重复]【英文标题】:Pyspark SQL split dataframe row's record [duplicate] 【发布时间】:2020-05-04 10:14:51 【问题描述】:

我使用的是 Spark 2.3.1。

我有一个像这样的 SparkSQL 数据框

|bigrams                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[a control, control circuit, circuit utilizes, utilizes patient, patient information, information and, and treatment-platform, treatment-platform information, information to, to optimize, optimize a, a radiation-treatment, radiation-treatment plan, plan by, by permitting, permitting isocenters, isocenters of, of various, various radiation-treatment, radiation-treatment fields, fields as, as comprise, comprise parts, parts of, of a, a same, same treatment, treatment plan, plan to, to not, not be, be coincidental, coincidental with, with one, one another, another to, to thereby, thereby yield, yield an, an optimized, optimized treatment, treatment plan., plan. the, the patient, patient information, information can, can pertain, pertain to, to one, one or, or more, more physical, physical aspects, aspects of, of the, the patient, patient as, as desired., desired. by, by one, one approach,, approach, the, the foregoing, foregoing can, can comprise, comprise scattering, scattering the, the isocenters, isocenters of, of the, the various, various radiation-treatment, radiation-treatment fields, fields around, around a, a predetermined, predetermined point, point (such, (such as,, as, for, for example,, example, the, the center, center of, of the, the treatment, treatment volume, volume and/or, and/or some, some or, or all, all of, of the, the beams)., beams). this, this approach, approach can, can comprise, comprise causing, causing an, an area, area of, of highest, highest energy, energy flux, flux for, for a, a given, given field, field to, to be, be non-coincident, non-coincident for, for at, at least, least some, some of, of the, the radiation-treatment, radiation-treatment fields, fields as, as are, are specified, specified by, by the, the radiation-treatment, radiation-treatment plan.]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

我想用 (,) 分割这个数据框行的记录,并像这样将每条记录保存在新行中

|bigrams             |
+--------------------+
|a control           |
|control circuit     |
|circuit utilizes    |
....
+--------------------+

【问题讨论】:

【参考方案1】:

只需像这样使用explode函数

df.withColumn("exploded", explode($"bigrams")).select("exploded")

【讨论】:

以上是关于Pyspark SQL拆分数据框行的记录[重复]的主要内容,如果未能解决你的问题,请参考以下文章

如何根据行的内容拆分pyspark数据框

如何将数据集拆分为两个具有唯一和重复行的数据集?

如何将数据集拆分为两个具有唯一和重复行的数据集?

从文件中读取规则并将这些规则应用于 pyspark 数据框行

将重复记录合并到 pyspark 数据框中的单个记录中

按给定列表的顺序选择重复的熊猫数据框行并保留原始索引