经验分享为什么hive在大表上加条件后执行limit很慢

Posted barneywill

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了经验分享为什么hive在大表上加条件后执行limit很慢相关的知识,希望对你有一定的参考价值。

问题重现

select id from big_table where name = ‘sdlkfjalksdjfla‘ limit 100;

首先看执行计划:

hive> explain select * from big_table where name = ‘sdlkfjalksdjfla‘ limit 100;

OK

STAGE DEPENDENCIES:

  Stage-0 is a root stage

 

STAGE PLANS:

  Stage: Stage-0

    Fetch Operator

      limit: 100

      Processor Tree:

        TableScan

          alias: big_table 

          Statistics: Num rows: 7497189457 Data size: 1499437891589 Basic stats: COMPLETE Column stats: NONE

          Filter Operator

            predicate: (name = ‘sdlkfjalksdjfla‘) (type: boolean)

            Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE

            Select Operator

              expressions: id (type: string)

              outputColumnNames: _col0

              Statistics: Num rows: 3748594728 Data size: 749718945694 Basic stats: COMPLETE Column stats: NONE

              Limit

                Number of rows: 100

                Statistics: Num rows: 100 Data size: 20000 Basic stats: COMPLETE Column stats: NONE

                ListSink

 

Time taken: 0.668 seconds, Fetched: 23 row(s)

可见只有一个stage,即Fetch Operator,再看执行过程:

   java.lang.Thread.State: RUNNABLE

        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)

        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)

        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)

        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

        - locked <0x00000006c1e00cd8> (a sun.nio.ch.Util$2)

        - locked <0x00000006c1e00cc8> (a java.util.Collections$UnmodifiableSet)

        - locked <0x00000006c1e00aa0> (a sun.nio.ch.EPollSelectorImpl)

        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)

        at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)

        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)

        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)

        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)

        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)

        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)

        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)

        at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:186)

        at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:146)

        - locked <0x000000076b9bccb0> (a org.apache.hadoop.hdfs.RemoteBlockReader2)

        at org.apache.hadoop.hdfs.BlockReaderUtil.readAll(BlockReaderUtil.java:32)

        at org.apache.hadoop.hdfs.RemoteBlockReader2.readAll(RemoteBlockReader2.java:363)

        at org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1072)

        at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:1000)

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1333)

        at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:78)

        at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)

        at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:166)

        at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:239)

        at org.apache.orc.impl.RecordReaderImpl.beginReadStripe(RecordReaderImpl.java:858)

        at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:829)

        at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:986)

        at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1021)

        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1057)

        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.ensureBatch(RecordReaderImpl.java:77)

        at org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.hasNext(RecordReaderImpl.java:89)

        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:231)

        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:206)

        at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:488)

        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)

        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)

        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)

        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)

        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)

        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)

        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)

        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)

        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:497)

        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)

        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

可见并没有提交远程job而是在本地直接做table scan,如果是在一个大表上加复杂查询条件再做limit就会很慢,因为极有可能需要全表扫描之后才能收集到所需结果(limit条数),这也是为什么对大表不加条件直接limit反而很快的原因。

 

以上是关于经验分享为什么hive在大表上加条件后执行limit很慢的主要内容,如果未能解决你的问题,请参考以下文章

如何在大表上优化这个 mysql 连接?

executeUpdate 在大表上返回负值

在大表上使用 Django-Filter 以及 DataTables2

尽可能快地获取我的 MySQL 的第一行(也在大表上)

必须在大表上的 .Skip() 和 .Take() 之前在实体框架 4.1 中调用 .ToList()

如何使用具有多个 GROUP BY、子查询和 WHERE IN 在大表上的查询来优化查询?