pyspark 将多个 csv 文件合并为一个

Posted 2023-04-15

技术标签:

【中文标题】pyspark 将多个 csv 文件合并为一个【英文标题】：pyspark concatenate multiple csv files in one 【发布时间】：2021-05-17 11:33:33 【问题描述】：

我需要使用来自org.apache.hadoop.fs 的函数 concat(Path trg, Path[] psrcs) 和 pyspark

我的代码是：

orig1_fs = spark._jvm.org.apache.hadoop.fs.Path(f'tmp_pathfilename1')
orig2_fs = spark._jvm.org.apache.hadoop.fs.Path(f'tmp_pathfilename2')
dest_fs = spark._jvm.org.apache.hadoop.fs.Path(dest_path)    
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
fs.concat(dest_fs, list((orig1_fs , orig2_fs)))

但我得到错误： error

如何使用该功能？

【问题讨论】：

【参考方案1】：

这是因为concat 方法的第二个参数是Array 而不是ArrayList

# transform from `ArrayList<Path>` to `Path[]`
py_paths = [orig1_fs , orig2_fs]
java_paths = sc._gateway.new_array(spark._jvm.org.apache.hadoop.fs.Path, len(py_paths))
for i in range(len(py_paths)):
    java_paths[i] = py_paths[i]

# you can use the new array now
fs.concat(dest_fs, java_paths)

【讨论】：

以上是关于pyspark 将多个 csv 文件合并为一个的主要内容，如果未能解决你的问题，请参考以下文章

将具有不同架构（列）的多个文件 (.csv) 合并/合并为单个文件 .csv - Azure 数据工厂

将多个CSV文件合并为一个

如何将多个 csv 文件合并为单个 csv 文件

使用 PowerShell 将多个 CSV 文件合并为一个

如何将多个csv按行合并？（不是首尾相接的按列合并）

如何使用 Pandas 将多个 csv 文件中的单个数据列合并为一个？