用 Apache Hudi 编写的 Parquet 文件名的每个部分代表啥？

Posted 2023-03-23

技术标签:

【中文标题】用 Apache Hudi 编写的 Parquet 文件名的每个部分代表啥？【英文标题】：What does each section of the Parquet file name written with Apache Hudi represent?用 Apache Hudi 编写的 Parquet 文件名的每个部分代表什么？ 【发布时间】：2021-12-28 19:18:20 【问题描述】：

Apache Hudi 写出每个 parquet 文件，如下所示：

0743209d-51cb-4233-a7cd-5bb712fba1ff-0_21-64-5300_20211117172738.parquet

我试图了解文件的每个部分代表什么。这是我目前的理解，但我希望任何可能知道的人确认和澄清。

0743209d-51cb-4233-a7cd-5bb712fba1ff = file group/file name

-0 = file chunk

20211117172738 = timestamp of the batch

我不确定以下部分代表什么：

21-64-5300=?

【问题讨论】：

【参考方案1】：

这是我的发现：

hudi file format -- 0743209d-51cb-4233-a7cd-5bb712fba1ff-0_21-64-5300_20211117172738.parquet
first part is a unique identifier of the file group.
next is write token.
and then the commit time.
Write token is to assist with detecting spark write failures.

public static String makeDataFileName(String instantTime, String writeToken, String fileId, String fileExtension) 
    return String.format("%s_%s_%s%s", fileId, writeToken, instantTime, fileExtension);

【讨论】：

以上是关于用 Apache Hudi 编写的 Parquet 文件名的每个部分代表啥？的主要内容，如果未能解决你的问题，请参考以下文章