我可以更改 Spark 数据框中列的可空性吗？

Posted 2023-04-15

技术标签:

【中文标题】我可以更改 Spark 数据框中列的可空性吗？【英文标题】：Can I change the nullability of a column in my Spark dataframe? 【发布时间】：2017-09-06 10:06:27 【问题描述】：

我在数据框中有一个不可为空的 StructField。简单例子：

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields

[StructField(name,StringType,true), StructField(年龄,LongType,true), StructField(foo,BooleanType,false)]

请注意，foo 字段不可为空。问题是（出于我不会讨论的原因）我希望它可以为空。我发现这篇帖子 Change nullable property of column in spark dataframe 提出了一种方法，因此我将其中的代码修改为：

import pyspark.sql.functions as F
from pyspark.sql.types import *
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, newSchema)

失败了：

TypeError: StructField(name,StringType,true) 不是 JSON 可序列化的

我也在堆栈跟踪中看到了这一点：

raise ValueError("检测到循环引用")

所以我有点卡住了。任何人都可以修改这个示例，使我能够定义一个数据框，其中列 foo 可以为空？

【问题讨论】：

【参考方案1】：

我知道这个问题已经得到解答，但是当我想出这个时，我正在寻找一个更通用的解决方案：

def set_df_columns_nullable(spark, df, column_list, nullable=True):
    for struct_field in df.schema:
        if struct_field.name in column_list:
            struct_field.nullable = nullable
    df_mod = spark.createDataFrame(df.rdd, df.schema)
    return df_mod

你可以这样称呼它：

set_df_columns_nullable(spark,df,['name','age'])

【讨论】：

很好的答案。这样做有什么性能影响吗？当你基于现有的 RDD “创建一个新的数据框”时，究竟会发生什么？【参考方案2】：

对于一般情况，可以通过特定列的StructField 的nullable 属性更改列的可空性。这是一个例子：

df.schema['col_1']
# StructField(col_1,DoubleType,false)

df.schema['col_1'].nullable = True

df.schema['col_1']
# StructField(col_1,DoubleType,true)

【讨论】：

【参考方案3】：

您似乎错过了 StructType(newSchema)。

l = [('Alice', 1)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df = df.withColumn('foo', F.when(df['name'].isNull(),False).otherwise(True))
df.schema.fields
newSchema = [StructField('name',StringType(),True), StructField('age',LongType(),True),StructField('foo',BooleanType(),False)]
df2 = sqlContext.createDataFrame(df.rdd, StructType(newSchema))
df2.show()

【讨论】：

【参考方案4】：

df1 = df.rdd.toDF()
df1.printSchema()

输出：

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- foo: boolean (nullable = true)

【讨论】：

以上是关于我可以更改 Spark 数据框中列的可空性吗？的主要内容，如果未能解决你的问题，请参考以下文章