COUNT(*) 在配置单元中返回多行而不是一
Posted
技术标签:
【中文标题】COUNT(*) 在配置单元中返回多行而不是一【英文标题】:COUNT(*) returning multiple rows in hive instead one 【发布时间】:2020-04-15 12:41:51 【问题描述】:为什么 COUNT(*) 在 Hive 中返回 2 个计数 观察到两个返回值的总和等于表中记录的实际计数。请注意,我没有使用任何 where , group by 子句询问。下面是我的创建查询
CREATE TABLE `employee`(
`file_id` string,
`record_id` bigint,
`emp_id` bigint,
`dept` string,
`salary` bigint
)
PARTITIONED BY (
`load_date` string)
CLUSTERED BY (
`emp_id`)
INTO 7 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://data/schema/emp/employee'
TBLPROPERTIES (
'bucketing_version'='2',
'compactor.mapreduce.map.memory.mb'='2048',
'compactorthreshold.hive.compactor.delta.num.threshold'='4',
'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',
'discover.partitions'='true',
'last_modified_by'='pately',
'last_modified_time'='1579080979',
'orc.compress'='ZLIB',
'transactional'='true',
'transactional_properties'='default',
'transient_lastDdlTime'='1579080979');
我正在尝试如下选择计数
select count(*) from `employee`;
减少任务已设置为 1
set mapred.reduce.tasks;
+-------------------------+
| set |
+-------------------------+
| mapred.reduce.tasks=-1 |
+-------------------------+
结果:
INFO : Compiling command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428): select count(*) from employee
INFO : Semantic Analysis Completed (retrial = false)
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null)
INFO : Completed compiling command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428); Time taken: 0.176 seconds
INFO : Executing command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428): select count(*) from employee
INFO : Query ID = dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428
INFO : Total jobs = 1
INFO : Launching Job 1 out of 1
INFO : Starting task [Stage-1:MAPRED] in serial mode
INFO : Subscribed to counters: [] for queryId: dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428
INFO : Session is already open
INFO : Dag name: select count(*) from employee (Stage-1)
INFO : Tez session was closed. Reopening...
INFO : Session re-established.
INFO : Session re-established.
INFO : Status: Running (Executing on YARN cluster with App id application_1586778720046_0798)
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 2 2 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 6.14 s
----------------------------------------------------------------------------------------------
INFO : Status: DAG finished successfully in 6.12 seconds
INFO :
INFO : Query Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : OPERATION DURATION
INFO : ----------------------------------------------------------------------------------------------
INFO : Compile Query 0.18s
INFO : Prepare Plan 0.06s
INFO : Get Query Coordinator (AM) 0.00s
INFO : Submit Plan 5.75s
INFO : Start DAG 0.89s
INFO : Run DAG 6.12s
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : Task Execution Summary
INFO : ----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS
INFO : ----------------------------------------------------------------------------------------------
INFO : Map 1 2587.00 15,520 147 12,506 0
INFO : ----------------------------------------------------------------------------------------------
INFO :
INFO : org.apache.tez.common.counters.DAGCounter:
INFO : NUM_SUCCEEDED_TASKS: 2
INFO : TOTAL_LAUNCHED_TASKS: 2
INFO : RACK_LOCAL_TASKS: 2
INFO : AM_CPU_MILLISECONDS: 4030
INFO : AM_GC_TIME_MILLIS: 0
INFO : File System Counters:
INFO : HDFS_BYTES_READ: 3956004
INFO : HDFS_BYTES_WRITTEN: 232
INFO : HDFS_READ_OPS: 363
INFO : HDFS_WRITE_OPS: 4
INFO : HDFS_OP_CREATE: 2
INFO : HDFS_OP_GET_FILE_STATUS: 125
INFO : HDFS_OP_OPEN: 238
INFO : HDFS_OP_RENAME: 2
INFO : org.apache.tez.common.counters.TaskCounter:
INFO : GC_TIME_MILLIS: 147
INFO : TASK_DURATION_MILLIS: 4661
INFO : CPU_MILLISECONDS: 15520
INFO : PHYSICAL_MEMORY_BYTES: 4294967296
INFO : VIRTUAL_MEMORY_BYTES: 14757040128
INFO : COMMITTED_HEAP_BYTES: 4294967296
INFO : INPUT_RECORDS_PROCESSED: 119
INFO : INPUT_SPLIT_LENGTH_BYTES: 3884610
INFO : OUTPUT_RECORDS: 0
INFO : HIVE:
INFO : CREATED_FILES: 2
INFO : DESERIALIZE_ERRORS: 0
INFO : RECORDS_IN_Map_1: 12506
INFO : RECORDS_OUT_0: 2
INFO : RECORDS_OUT_INTERMEDIATE_Map_1: 0
INFO : RECORDS_OUT_OPERATOR_FS_11: 2
INFO : RECORDS_OUT_OPERATOR_GBY_10: 2
INFO : RECORDS_OUT_OPERATOR_MAP_0: 0
INFO : RECORDS_OUT_OPERATOR_SEL_9: 12506
INFO : RECORDS_OUT_OPERATOR_TS_0: 12506
INFO : TaskCounter_Map_1_INPUT_employee:
INFO : INPUT_RECORDS_PROCESSED: 119
INFO : INPUT_SPLIT_LENGTH_BYTES: 3884610
INFO : TaskCounter_Map_1_OUTPUT_out_Map_1:
INFO : OUTPUT_RECORDS: 0
INFO : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters:
INFO : GROUPED_INPUT_SPLITS_Map_1: 2
INFO : INPUT_DIRECTORIES_Map_1: 33
INFO : INPUT_FILES_Map_1: 119
INFO : RAW_INPUT_SPLITS_Map_1: 119
INFO : savedToCache: true
INFO : Completed executing command(queryId=dev_hive_20200415150730_0bd6c1d9-1f79-4a42-b7b5-f4a9411b9428); Time taken: 12.819 seconds
INFO : OK
+-------+
| _c0 |
+-------+
| 6949 |
| 5557 |
+-------+
2 rows selected (13.073 seconds)
有人可以帮助为什么像 count 这样的聚合 fxn 会给出多行输出
【问题讨论】:
【参考方案1】:这是 Hive 中的错误。当次要压缩未完成时,它将返回多个值。
解决方案是添加到查询中
select count(*) from table limit 1
【讨论】:
以上是关于COUNT(*) 在配置单元中返回多行而不是一的主要内容,如果未能解决你的问题,请参考以下文章
spark - scala - 如何检查配置单元中是不是存在表