如果未添加新分区，则需要 hive 每日 msck 修复

Posted 2023-03-23

技术标签:

【中文标题】如果未添加新分区，则需要 hive 每日 msck 修复【英文标题】：hive daily msck repair needed if new partition not added 【发布时间】：2020-07-24 18:26:26 【问题描述】：

我有一个 hive 表，其中包含数据，并且它在基于 Year.Now 的分区列上分区。现在每天都会将数据加载到这个 hive 表中。我没有选择每天进行 msck 修复。我的分区是基于年份的。如果没有添加新分区，我是否需要在每日加载后进行 msck 修复。我在下面尝试过

val data = Seq(Row("1","2020-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("key",StringType,true)
                      ,StructField("txn_ts",StringType,true)
                      ,StructField("txn_dt",StringType,true))

val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("overwrite").partitionBy("txn_dt").avro("/test_a")

HIVE 外部表

create external table test_a(
key    string,
txn_ts string
)
partitioned by (txn_dt string)
stored as avro
location '/test_a';

msck repair table test_a;
select * from test_a;

【问题讨论】：

【参考方案1】：

    Noticed if new partition not added msck repair is not needed
    
    msck repair table test_a;
    select * from test_a;
    
        +----------------+--------------------------+------------------------+--+
        | test_a.rowkey  |      test_a.txn_ts       | test_a.order_entry_dt  |
        +----------------+--------------------------+------------------------+--+
        | 1              | 2020-05-11 15:17:57.188  | 2020                   |
        +----------------+--------------------------+------------------------+--+
    
    Now added 1 more row with the same partition value (2020) 
    
        val data = Seq(Row("2","2021-05-11 15:17:57.188","2020"))
        val schemaOrig = List( StructField("rowkey",StringType,true)
        ,StructField("txn_ts",StringType,true)
        ,StructField("order_entry_dt",StringType,true))
        val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
        sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
    
    **HIVE QUERY rETURNED 2 ROWS**
        select * from test_a;

    +----------------+--------------------------+------------------------+--+
    | test_a.rowkey  |      test_a.txn_ts       | test_a.order_entry_dt  |
    +----------------+--------------------------+------------------------+--+
    | 1              | 2021-05-11 15:17:57.188  | 2020                   |
    | 2              | 2020-05-11 15:17:57.188  | 2020                   |
    +----------------+--------------------------+------------------------+--+
    
        --Now tried adding NEW PARTITION (2021) to see if select query will 
 return it with out msck
        val data = Seq(Row("3","2021-05-11 15:17:57.188","2021"))
        val schemaOrig = List( StructField("rowkey",StringType,true)
        ,StructField("txn_ts",StringType,true)
        ,StructField("order_entry_dt",StringType,true))
        val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
        sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
    
    QUERY AGAIN RETURNED 2 ROWS ONLY INSTEAD OF 3 ROWS with out msck repair

【讨论】：

如果没有添加新分区，则无需修复表。 Hive 的元数据会跟踪表分区，“修复”只是意味着将元数据与创建的分区文件夹同步。

以上是关于如果未添加新分区，则需要 hive 每日 msck 修复的主要内容，如果未能解决你的问题，请参考以下文章