组合具有不同列数的 Spark 数据帧

Posted 2023-04-18

技术标签:

【中文标题】组合具有不同列数的 Spark 数据帧【英文标题】：Combining Spark Data Frames with Different Number of Columns 【发布时间】：2021-06-27 22:56:29 【问题描述】：

在this 问题中，我曾询问如何将 PySpark 数据帧与不同数量的列组合在一起。给出的答案要求每个数据框必须具有相同数量的列才能将它们全部组合起来：

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder\
    .appName("DynamicFrame")\
    .getOrCreate()

df01 = spark.createDataFrame([(1, 2, 3), (9, 5, 6)], ("C1", "C2", "C3"))
df02 = spark.createDataFrame([(11,12, 13), (10, 15, 16)], ("C2", "C3", "C4"))
df03 = spark.createDataFrame([(111,112), (110, 115)], ("C1", "C4"))

dataframes = [df01, df02, df03]

# Create a list of all the column names and sort them
cols = set()
for df in dataframes:
    for x in df.columns:
        cols.add(x)
cols = sorted(cols)

# Create a dictionary with all the dataframes
dfs = 
for i, d in enumerate(dataframes):
    new_name = 'df' + str(i)  # New name for the key, the dataframe is the value
    dfs[new_name] = d
    # Loop through all column names. Add the missing columns to the dataframe (with value 0)
    for x in cols:
        if x not in d.columns:
            dfs[new_name] = dfs[new_name].withColumn(x, lit(0))
    dfs[new_name] = dfs[new_name].select(cols)  # Use 'select' to get the columns sorted

# Now put it al together with a loop (union)
result = dfs['df0']      # Take the first dataframe, add the others to it
dfs_to_add = dfs.keys()  # List of all the dataframes in the dictionary
dfs_to_add.remove('df0') # Remove the first one, because it is already in the result
for x in dfs_to_add:
    result = result.union(dfs[x])
result.show()

有什么方法可以组合 PySpark 数据帧而不必确保所有数据帧具有相同的列数？我问的原因是 100 个数据帧合并大约需要 2 天，但使用上述代码的过程超时。

【问题讨论】：

如果您使用 spark 3.1，请使用 unionByName 并将 allowMissingColumns 设置为 True 【参考方案1】：

df = df1.unionByName(df2, allowMissingColumns=True)

【讨论】：

这仅适用于 Spark 3.1 吗？是的，如果您使用旧版本的 union 工作，您应该确保您组合的所有数据帧都具有相同的列名。使用 withColumn 并在列不存在的地方添加空值所以必须做类似于这个问题的答案：***.com/questions/53165816/… 有更简单的方法吗？您可以将它添加到旧版本 Spark 的答案中吗？

以上是关于组合具有不同列数的 Spark 数据帧的主要内容，如果未能解决你的问题，请参考以下文章

如何计算不同数据框的列之间的数值差异？

具有不同列数的数据表

pandas - 追加具有不同列数的新行

如何“扁平化”具有可变列数的 Spark 模式？

Dataprep将具有不同列数的文件导入数据集

将具有未知列数的数据导入R？