从 HDFS 和 Schema 问题读取 Parquet

Posted 2023-04-15

技术标签:

【中文标题】从 HDFS 和 Schema 问题读取 Parquet【英文标题】：Parquet read from HDFS and Schema issue 【发布时间】：2020-02-24 20:02:07 【问题描述】：

当我尝试从 HDFS 读取镶木地板文件时，我得到了所有混合大小写的架构。有什么方法可以将其转换为全部小写？

df=spark.read.parquet(hdfs_location)

df.printSchema();
root
|-- RecordType: string (nullable = true)
|-- InvestmtAccnt: string (nullable = true)
|-- InvestmentAccntId: string (nullable = true)
|-- FinanceSummaryID: string (nullable = true)
|-- BusinDate: string (nullable = true)

What i need is like below


root
|-- recordtype: string (nullable = true)
|-- investmtaccnt: string (nullable = true)
|-- investmentaccntid: string (nullable = true)
|-- financesummaryid: string (nullable = true)
|-- busindate: string (nullable = true)

【问题讨论】：

【参考方案1】：

首先读取拼花文件

df=spark.read.parquet(hdfs_location)

然后使用.toDF 函数创建包含所有lower column names 的数据框

df=df.toDF(*[c.lower() for c in df.columns])

【讨论】：

以上是关于从 HDFS 和 Schema 问题读取 Parquet的主要内容，如果未能解决你的问题，请参考以下文章