读取列中具有混合数据类型的镶木地板文件

Posted 2023-04-17

技术标签:

【中文标题】读取列中具有混合数据类型的镶木地板文件【英文标题】：Read parquet file having mixed data type in a column 【发布时间】：2015-12-20 18:15:51 【问题描述】：

我想使用 spark sql 读取一个 parquet 文件，其中一列具有混合数据类型（字符串和整数）。

val sqlContext = new SQLContext(sparkContext)
val df = sqlContext.read.parquet("/tmp/data")

这引发了我的异常：Failed to merge incompatible data types IntegerType and StringType

有没有办法在读取期间显式类型转换列？

【问题讨论】：

有什么办法可以解决这个问题吗？很遗憾没有。似乎唯一的方法是在写入时强制模式。所以阅读效果很好 【参考方案1】：

我发现的唯一方法是手动转换其中一个字段以使其匹配。您可以通过将各个 parquet 文件读入一个序列并迭代地修改它们来做到这一点：

def unionReduce(dfs: Seq[DataFrame]) = 
  dfs.reduce (x, y) =>
    def schemaTruncate(df: DataFrame) = df.schema.map(schema => schema.name -> schema.dataType)
    val diff = schemaTruncate(y).toSet.diff(schemaTruncate(x).toSet)
    val fixedX = diff.foldLeft(x)  case (df, (name, dataType)) =>
      Try(df.withColumn(name, col(name).cast(dataType))) match 
        case Success(newDf) => newDf
        case Failure(error) => df.withColumn(name, lit(null).cast(dataType))
      
    
    fixedX.select(y.columns.map(col): _*).unionAll(y)

上述函数首先找到在 Y 中但不在 X 中的不同名称或类型的列。然后通过尝试强制转换现有列将这些列添加到 X，并在失败时将列添加为文字 null，然后它仅从新的固定 X 中选择 Y 中的列，以防 X 中的列不在 Y 中，并返回联合的结果。

【讨论】：

以上是关于读取列中具有混合数据类型的镶木地板文件的主要内容，如果未能解决你的问题，请参考以下文章