如何确保镶木地板文件包含元数据中的行数?
Posted
技术标签:
【中文标题】如何确保镶木地板文件包含元数据中的行数?【英文标题】:How ensure that parquet files contains row count in metadata? 【发布时间】:2022-01-17 21:32:46 【问题描述】:查看来源:fast-parquet-row-count-in-spark 和 parquet-count-metadata-explanation
*** 和官方 spark 文档告诉我们 parquet 文件应该在元数据中包含 row count
。从 1.6 开始,spark 默认添加了这个
我试图看到这个“领域”,但没有运气。可能是我做错了什么?有人能告诉我如何确保某些镶木地板文件有这样的文件吗?欢迎任何指向小而好的镶木地板文件的链接!现在我使用参数meta D:\myparquet_file.parquet
调用org.apache.parquet.tools.Main
并在结果中看不到count
关键字。
【问题讨论】:
【参考方案1】:您可以使用parquet-tools 检查镶木地板文件:
-
安装
parquet-tools
:
pip install parquet-tools
-
创建镶木地板文件。我使用 spark 创建了一个包含 3 行的小型 parquet 文件:
import spark.implicits._
val df: DataFrame = Seq((1, 2, 3), (4, 5, 6), (7, 8, 9)).toDF("col1", "col2", "col3")
df.coalesce(1).write.parquet("data/")
-
检查镶木地板文件:
parquet-tools inspect /path/to/parquet/file
输出应该是这样的:
############ file meta data ############
created_by: parquet-mr version 1.10.1 (build a89df8f9932b6ef6633d06069e50c9b7970bebd1)
num_columns: 3
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 654
############ Columns ############
col1
col2
col3
############ Column(col1) ############
name: col1
path: col1
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
############ Column(col2) ############
name: col2
path: col2
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
############ Column(col3) ############
name: col3
path: col3
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
您可以在file meta data
部分下看到num_rows
字段,它表示parquet 文件中的行数。
【讨论】:
【参考方案2】:您可以在行组旁边的 RC 字段中找到行数。
row group 1: RC:148192 TS:10503944 OFFSET:4
parquet-tool 的完整输出,下面带有 meta 选项。
> parquet-tools meta part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
file: file:/Users/matthewropp/team_demo/los-angeles-parking-citations/raw_citations/issue_month=201902/part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet
creator: parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
extra: org.apache.spark.sql.parquet.row.metadata = "type":"struct","fields":["name":":created_at","type":"string","nullable":true,"metadata":,"name":":id","type":"string","nullable":true,"metadata":,"name":":updated_at","type":"string","nullable":true,"metadata":,"name":"agency","type":"integer","nullable":true,"metadata":,"name":"body_style","type":"string","nullable":true,"metadata":,"name":"color","type":"string","nullable":true,"metadata":,"name":"fine_amount","type":"integer","nullable":true,"metadata":,"name":"issue_date","type":"date","nullable":true,"metadata":,"name":"issue_time","type":"integer","nullable":true,"metadata":,"name":"latitude","type":"decimal(8,1)","nullable":true,"metadata":,"name":"location","type":"string","nullable":true,"metadata":,"name":"longitude","type":"decimal(8,1)","nullable":true,"metadata":,"name":"make","type":"string","nullable":true,"metadata":,"name":"marked_time","type":"string","nullable":true,"metadata":,"name":"meter_id","type":"string","nullable":true,"metadata":,"name":"plate_expiry_date","type":"date","nullable":true,"metadata":,"name":"route","type":"string","nullable":true,"metadata":,"name":"rp_state_plate","type":"string","nullable":true,"metadata":,"name":"ticket_number","type":"string","nullable":false,"metadata":,"name":"vin","type":"string","nullable":true,"metadata":,"name":"violation_code","type":"string","nullable":true,"metadata":,"name":"violation_description","type":"string","nullable":true,"metadata":]
file schema: spark_schema
--------------------------------------------------------------------------------
: created_at: OPTIONAL BINARY O:UTF8 R:0 D:1
: id: OPTIONAL BINARY O:UTF8 R:0 D:1
: updated_at: OPTIONAL BINARY O:UTF8 R:0 D:1
agency: OPTIONAL INT32 R:0 D:1
body_style: OPTIONAL BINARY O:UTF8 R:0 D:1
color: OPTIONAL BINARY O:UTF8 R:0 D:1
fine_amount: OPTIONAL INT32 R:0 D:1
issue_date: OPTIONAL INT32 O:DATE R:0 D:1
issue_time: OPTIONAL INT32 R:0 D:1
latitude: OPTIONAL INT32 O:DECIMAL R:0 D:1
location: OPTIONAL BINARY O:UTF8 R:0 D:1
longitude: OPTIONAL INT32 O:DECIMAL R:0 D:1
make: OPTIONAL BINARY O:UTF8 R:0 D:1
marked_time: OPTIONAL BINARY O:UTF8 R:0 D:1
meter_id: OPTIONAL BINARY O:UTF8 R:0 D:1
plate_expiry_date: OPTIONAL INT32 O:DATE R:0 D:1
route: OPTIONAL BINARY O:UTF8 R:0 D:1
rp_state_plate: OPTIONAL BINARY O:UTF8 R:0 D:1
ticket_number: REQUIRED BINARY O:UTF8 R:0 D:0
vin: OPTIONAL BINARY O:UTF8 R:0 D:1
violation_code: OPTIONAL BINARY O:UTF8 R:0 D:1
violation_description: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:148192 TS:10503944 OFFSET:4
--------------------------------------------------------------------------------
: created_at: BINARY SNAPPY DO:0 FPO:4 SZ:607/616/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
: id: BINARY SNAPPY DO:0 FPO:611 SZ:2365472/3260525/1.38 VC:148192 ENC:BIT_PACKED,PLAIN,RLE ST:[min: row-2229_y75z.ftdu, max: row-zzzs_4hta.8fub, num_nulls: 0]
: updated_at: BINARY SNAPPY DO:0 FPO:2366083 SZ:602/611/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-28T00:16:06.329Z, max: 2019-03-02T00:20:00.249Z, num_nulls: 0]
agency: INT32 SNAPPY DO:0 FPO:2366685 SZ:4871/5267/1.08 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 58, num_nulls: 0]
body_style: BINARY SNAPPY DO:0 FPO:2371556 SZ:36244/61827/1.71 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WR, num_nulls: 0]
color: BINARY SNAPPY DO:0 FPO:2407800 SZ:111267/111708/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YL, num_nulls: 0]
fine_amount: INT32 SNAPPY DO:0 FPO:2519067 SZ:71989/82138/1.14 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 25, max: 363, num_nulls: 63]
issue_date: INT32 SNAPPY DO:0 FPO:2591056 SZ:20872/23185/1.11 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2019-02-01, max: 2019-02-27, num_nulls: 0]
issue_time: INT32 SNAPPY DO:0 FPO:2611928 SZ:210026/210013/1.00 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 1, max: 2359, num_nulls: 41]
latitude: INT32 SNAPPY DO:0 FPO:2821954 SZ:508049/512228/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 6513161.2, num_nulls: 0]
location: BINARY SNAPPY DO:0 FPO:3330003 SZ:1251364/2693435/2.15 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE ST:[min: , max: ZOMBAR/VALERIO, num_nulls: 0]
longitude: INT32 SNAPPY DO:0 FPO:4581367 SZ:516233/520692/1.01 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 99999.0, max: 1941557.4, num_nulls: 0]
make: BINARY SNAPPY DO:0 FPO:5097600 SZ:147034/150364/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YAMA, num_nulls: 0]
marked_time: BINARY SNAPPY DO:0 FPO:5244634 SZ:11675/17658/1.51 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: 959.0, num_nulls: 0]
meter_id: BINARY SNAPPY DO:0 FPO:5256309 SZ:172432/256692/1.49 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YO97, num_nulls: 0]
plate_expiry_date: INT32 SNAPPY DO:0 FPO:5428741 SZ:149849/152288/1.02 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 2000-02-01, max: 2099-12-01, num_nulls: 18624]
route: BINARY SNAPPY DO:0 FPO:5578590 SZ:38377/45948/1.20 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: WTD, num_nulls: 0]
rp_state_plate: BINARY SNAPPY DO:0 FPO:5616967 SZ:33281/60186/1.81 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: AB, max: XX, num_nulls: 0]
ticket_number: BINARY SNAPPY DO:0 FPO:5650248 SZ:801039/2074791/2.59 VC:148192 ENC:BIT_PACKED,PLAIN ST:[min: 1020798376, max: 4350802142, num_nulls: 0]
vin: BINARY SNAPPY DO:0 FPO:6451287 SZ:64/60/0.94 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: , num_nulls: 0]
violation_code: BINARY SNAPPY DO:0 FPO:6451351 SZ:94784/131071/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: 000, max: 8942, num_nulls: 0]
violation_description: BINARY SNAPPY DO:0 FPO:6546135 SZ:95937/132641/1.38 VC:148192 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE ST:[min: , max: YELLOW ZONE, num_nulls: 0]
> parquet-tools dump -m -c make part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet | head -20
BINARY make
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 148192 ***
value 1: R:0 D:1 V:HYDA
value 2: R:0 D:1 V:NISS
value 3: R:0 D:1 V:NISS
value 4: R:0 D:1 V:TOYO
value 5: R:0 D:1 V:AUDI
value 6: R:0 D:1 V:MERC
value 7: R:0 D:1 V:LEX
value 8: R:0 D:1 V:BMW
value 9: R:0 D:1 V:GMC
value 10: R:0 D:1 V:HOND
value 11: R:0 D:1 V:TOYO
value 12: R:0 D:1 V:NISS
value 13: R:0 D:1 V:
value 14: R:0 D:1 V:THOR
value 15: R:0 D:1 V:DODG
value 16: R:0 D:1 V:DODG
value 17: R:0 D:1 V:HOND
【讨论】:
不清楚:)RC:148192
- 是行数吗?
我相信它是行组 1 的行数。从输出中可以明显看出 - ***行组 1 of 1,值 1 到 148192 ***,如果我错了,请纠正我以上是关于如何确保镶木地板文件包含元数据中的行数?的主要内容,如果未能解决你的问题,请参考以下文章