如何转换由 | 分隔的顺序数据并且在 pyspark 中的行和列中没有换行符

Posted 2023-03-28

技术标签:

【中文标题】如何转换由 | 分隔的顺序数据并且在 pyspark 中的行和列中没有换行符【英文标题】：how to convert a sequential data seperated by | and does not have newline character into rows and columns in pyspark 【发布时间】：2021-07-25 08:21:45 【问题描述】：

我正在尝试读取具有由 | 分隔的顺序数据的文本文件如下。

1|Ariyalur|1|2|3|2|Coimbatore|4|6|1.12|3|Cuddalore|8|3|7

我希望 spark 将其作为数据帧读取，并将第 5 个分隔符（即 |）替换为 \n。输出应该是

Serial_Number|District|Area|Production|profit
1|Ariyalur|1|2|3
2|Coimbatore|4|6|1.12
3|Cuddalore|8|3|7

使用replace函数替换所有|。如何仅替换 | 的第 5 个实例。

【问题讨论】：

【参考方案1】：

在尝试了几种不同的方法后，我认为应用正则表达式模式是最好的分割方法，它有点棘手，但它有效

split_token = '~' # this is the token to split, you can decide what's the best token depends on your data

(df
    .withColumn('tmp', F.regexp_replace(F.col('seq'), '(([^\|]+)\|)4[^\|]+', f'$0split_token')) # Add split token         =====>   1|Ariyalur|1|2|3~|2|Coimbatore|4|6|1.12~|3|Cuddalore|8|3|7~
    .withColumn('tmp', F.explode(F.split(F.col('tmp'), f'split_token\|?')))                       # Explode by split token  =====>   [ 1|Ariyalur|1|2|3, 2|Coimbatore|4|6|1.12, 3|Cuddalore|8|3|7, ]
    .withColumn('tmp', F.split(F.col('tmp'), '\|'))                                                 # Split by pipe (|)       =====>   [ 1, Ariyalur, 1, 2, 3 ]
    .withColumn('serial_number', F.col('tmp')[0]) # extract `serial_number`
    .withColumn('district', F.col('tmp')[1])      # extract `district`
    .withColumn('area', F.col('tmp')[2])          # extract `area`
    .withColumn('production', F.col('tmp')[3])    # extract `production`
    .withColumn('profit', F.col('tmp')[4])        # extract `profit`
    .drop('seq', 'tmp')                           # clean up
    .where(F.col('profit').isNotNull())           # clean up
    .show(10, False)
)

# +-------------+----------+----+----------+------+
# |serial_number|district  |area|production|profit|
# +-------------+----------+----+----------+------+
# |1            |Ariyalur  |1   |2         |3     |
# |2            |Coimbatore|4   |6         |1.12  |
# |3            |Cuddalore |8   |3         |7     |
# +-------------+----------+----+----------+------+

【讨论】：

以上是关于如何转换由 | 分隔的顺序数据并且在 pyspark 中的行和列中没有换行符的主要内容，如果未能解决你的问题，请参考以下文章