如果未添加新分区,则需要 hive 每日 msck 修复
Posted
技术标签:
【中文标题】如果未添加新分区,则需要 hive 每日 msck 修复【英文标题】:hive daily msck repair needed if new partition not added 【发布时间】:2020-07-24 18:26:26 【问题描述】:我有一个 hive 表,其中包含数据,并且它在基于 Year.Now 的分区列上分区。现在每天都会将数据加载到这个 hive 表中。我没有选择每天进行 msck 修复。我的分区是基于年份的。如果没有添加新分区,我是否需要在每日加载后进行 msck 修复。我在下面尝试过
val data = Seq(Row("1","2020-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("key",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("txn_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("overwrite").partitionBy("txn_dt").avro("/test_a")
HIVE 外部表
create external table test_a(
key string,
txn_ts string
)
partitioned by (txn_dt string)
stored as avro
location '/test_a';
msck repair table test_a;
select * from test_a;
【问题讨论】:
【参考方案1】: Noticed if new partition not added msck repair is not needed
msck repair table test_a;
select * from test_a;
+----------------+--------------------------+------------------------+--+
| test_a.rowkey | test_a.txn_ts | test_a.order_entry_dt |
+----------------+--------------------------+------------------------+--+
| 1 | 2020-05-11 15:17:57.188 | 2020 |
+----------------+--------------------------+------------------------+--+
Now added 1 more row with the same partition value (2020)
val data = Seq(Row("2","2021-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("order_entry_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
**HIVE QUERY rETURNED 2 ROWS**
select * from test_a;
+----------------+--------------------------+------------------------+--+
| test_a.rowkey | test_a.txn_ts | test_a.order_entry_dt |
+----------------+--------------------------+------------------------+--+
| 1 | 2021-05-11 15:17:57.188 | 2020 |
| 2 | 2020-05-11 15:17:57.188 | 2020 |
+----------------+--------------------------+------------------------+--+
--Now tried adding NEW PARTITION (2021) to see if select query will
return it with out msck
val data = Seq(Row("3","2021-05-11 15:17:57.188","2021"))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("order_entry_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
QUERY AGAIN RETURNED 2 ROWS ONLY INSTEAD OF 3 ROWS with out msck repair
【讨论】:
如果没有添加新分区,则无需修复表。 Hive 的元数据会跟踪表分区,“修复”只是意味着将元数据与创建的分区文件夹同步。以上是关于如果未添加新分区,则需要 hive 每日 msck 修复的主要内容,如果未能解决你的问题,请参考以下文章
使用msck修复hive分区时报错Unexpected partition key hour found at
Hive - 巨大的 10TB 表重新分区(添加新的分区列)