Oracle数据库统计实验

Posted dingdingfish

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Oracle数据库统计实验相关的知识,希望对你有一定的参考价值。

本实验来自Oracle开发者性能课第2课:Module 2: What are Database Statistics? 是此实验的重演和解读。

首先创建实验用的2张表,即bricks和colours,并生成统计信息:

-- 根据文档Database Reference,statistics_level的默认值为typical,设置成all后,会增加 timed operating system statistics 和plan execution statistics。
alter session set statistics_level = all;

create table bricks (
  brick_id         integer not null primary key,
  colour_rgb_value varchar2(10) not null,
  shape            varchar2(10) not null,
  weight           integer not null
);

create table colours (
  colour_rgb_value varchar2(10) not null,
  colour_name      varchar2(10) not null
);

insert into colours values ( 'FF0000', 'red' );
insert into colours values ( '00FF00', 'green' );
insert into colours values ( '0000FF', 'blue' );

insert into bricks
  select rownum,
         case mod ( level, 3 )
           when 0 then 'FF0000'
           when 1 then '00FF00'
           when 2 then '0000FF'
         end,
         case mod ( level, 3 )
           when 0 then 'cylinder'
           when 1 then 'cube'
           when 2 then 'pyramid'
         end,
         floor ( 100 / rownum )
  from   dual
  connect by level <= 100;
  
insert into bricks
  select rownum + 1000,
         case mod ( level, 3 )
           when 0 then 'FF0000'
           when 1 then '00FF00'
           when 2 then '0000FF'
         end,
         case mod ( level, 3 )
           when 0 then 'cylinder'
           when 1 then 'cube'
           when 2 then 'pyramid'
         end,
         floor ( 200 / rownum )
  from   dual
  connect by level <= 200;

commit;

declare
  stats dbms_stats.statrec;
  distcnt  number; 
  density  number;
  nullcnt  number; 
  avgclen  number;
begin

  dbms_stats.gather_table_stats ( null, 'colours' );
  dbms_stats.gather_table_stats ( null, 'bricks' );
  dbms_stats.set_table_stats ( null, 'bricks', numrows => 30 );
  dbms_stats.set_table_stats ( null, 'colours', numrows => 3000 );
  dbms_stats.get_column_stats ( null, 'colours', 'colour_rgb_value', 
    distcnt => distcnt, 
    density => density,
    nullcnt => nullcnt, 
    avgclen => avgclen,
    srec => stats
  );
  stats.minval := utl_raw.cast_to_raw ( '0000FF' );
  stats.maxval := utl_raw.cast_to_raw ( 'FF0000' );
  dbms_stats.set_column_stats ( null, 'colours', 'colour_rgb_value', distcnt => 10, srec => stats );
  dbms_stats.set_column_stats ( null, 'bricks', 'colour_rgb_value', distcnt => 10, srec => stats );

end;
/

插入数据后,colours表只有3行:

select /*ansiconsole*/ * from colours
COLOUR_RGB_VALUE   COLOUR_NAME   
FF0000              red            
00FF00              green          
0000FF              blue     

bricks表分两次插入,第一次100行,brick_id从1到100;第二次200行,brick_id从1001到1200。颜色有3种,按brick_id循环往复;形状也有3种(圆柱体,立方体和椎体),按brick_id循环往复。重量则为floor ( 100 / rownum )floor ( 200 / rownum ),因此重量的分布非常不均匀。

插入数据后,马上使用gather_table_stats和get_column_stats生成表和列的统计信息。以下为正确的统计信息:

select ut.table_name, ut.num_rows, 
       utcs.column_name, utcs.num_distinct, 
       case utc.data_type
         when 'VARCHAR2' then
           utl_raw.cast_to_varchar2 ( utcs.low_value ) 
        when 'NUMBER' then
           to_char ( utl_raw.cast_to_number ( utcs.low_value ) )
       end low_val, 
       case utc.data_type
         when 'VARCHAR2' then
           utl_raw.cast_to_varchar2 ( utcs.high_value ) 
         when 'NUMBER' then
           to_char ( utl_raw.cast_to_number ( utcs.high_value ) )
       end high_val
from   user_tables ut 
join   user_tab_cols utc
on     ut.table_name = utc.table_name
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
and    utc.column_name = utcs.column_name
and ut.table_name in ('BRICKS', 'COLOURS')
order  by ut.table_name, utcs.column_name;

   TABLE_NAME    NUM_ROWS         COLUMN_NAME    NUM_DISTINCT    LOW_VAL    HIGH_VAL
_____________ ___________ ___________________ _______________ __________ ___________
BRICKS                300 BRICK_ID                        300 1          1200
BRICKS                300 COLOUR_RGB_VALUE                  3 0000FF     FF0000
BRICKS                300 SHAPE                             3 cube       pyramid
BRICKS                300 WEIGHT                           27 1          200
COLOURS                 3 COLOUR_NAME                       3 blue       red
COLOURS                 3 COLOUR_RGB_VALUE                  3 0000FF     FF0000

6 rows selected.

为了让优化器误判,又使用set_table_stats和set_column_stats修改了表和列的统计信息,修改后的统计信息如下:

   TABLE_NAME    NUM_ROWS         COLUMN_NAME    NUM_DISTINCT    LOW_VAL    HIGH_VAL
_____________ ___________ ___________________ _______________ __________ ___________
BRICKS                 30 BRICK_ID                        300 1          1200
BRICKS                 30 COLOUR_RGB_VALUE                 10 0000FF     FF0000
BRICKS                 30 SHAPE                             3 cube       pyramid
BRICKS                 30 WEIGHT                           27 1          200
COLOURS              3000 COLOUR_NAME                       3 blue       red
COLOURS              3000 COLOUR_RGB_VALUE                 10 0000FF     FF0000

6 rows selected.

以上过程中指定的null参数,表示使用当前的schema。

具体修改了以下几项:

  • bricks表的行数由300改为30
  • colours表的行数由3改为3000
  • bricks表的colour_rgb_value列,num_distinct由3改为10
  • colours表的colour_rgb_value列,num_distinct由3改为10

目前为止,统计信息出现了极大的偏差。

行统计信息

执行以下的SQL,并获取执行计划:

set pages 9999
set lines 120

select /*+ gather_plan_statistics */ c.colour_name, count (*)
from   bricks b
join   colours c
on     c.colour_rgb_value = b.colour_rgb_value
group  by c.colour_name;

select * from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

------------------------------------------------------------------
| Id  | Operation           | Name    | Starts | E-Rows | A-Rows |
------------------------------------------------------------------
|   0 | SELECT STATEMENT    |         |      1 |        |      3 |
|   1 |  HASH GROUP BY      |         |      1 |      3 |      3 |
|*  2 |   HASH JOIN         |         |      1 |   9000 |    300 |
|   3 |    TABLE ACCESS FULL| BRICKS  |      1 |     30 |    300 |
|   4 |    TABLE ACCESS FULL| COLOURS |      1 |   3000 |      3 |
------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("C"."COLOUR_RGB_VALUE"="B"."COLOUR_RGB_VALUE")

先说第一个重要原则:

The optimizer prefers to start a join with the table returning the fewest rows.

优化器倾向于先从返回最少行的表开始联结(Join)。在执行计划中,A-Rows表示实际返回的行。显然,此执行计划不符合此原则,因为其先从返回300行的表bricks开始,然后再处理返回3行的表colours。显然这是被有偏差的统计信息所误导。需注意,返回较少是考虑了where条件的,对于一个1亿行的表,如果仅返回几行,也认为是返回较少。或者也可以说,编写SQL时,具有较高的selectivity(过滤掉最多的行)的WHERE条件应放在最前面

E-rows表示优化器估算的行数,Starts表示执行的次数。第二个原则是,如果E-Rows * Starts的结果和A-Rows接近,则表示是一个好的执行计划

因此,我们需要更正统计信息:

exec dbms_stats.gather_table_stats ( null, 'colours' ) ;
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

SELECT
    ut.table_name,
    ut.num_rows,
    utcs.column_name,
    utcs.num_distinct
FROM
         user_tables ut
    JOIN user_tab_col_statistics utcs ON ut.table_name = utcs.table_name
WHERE
    ut.table_name IN ( 'COLOURS', 'BRICKS' );

   TABLE_NAME    NUM_ROWS         COLUMN_NAME    NUM_DISTINCT
_____________ ___________ ___________________ _______________
BRICKS                300 BRICK_ID                        300
BRICKS                300 COLOUR_RGB_VALUE                  3
BRICKS                300 SHAPE                             3
BRICKS                300 WEIGHT                           27
COLOURS                 3 COLOUR_RGB_VALUE                  3
COLOURS                 3 COLOUR_NAME                       3

6 rows selected.

Oracle在两种情况下可自动收集统计信息:

  • 每日维护窗口
  • 当表更改的行数超过阈值时(默认为10%)

此阈值可以修改,例如以下将默认阈值修改为1%:

select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual;
exec dbms_stats.set_table_prefs ( null, 'colours', 'STALE_PERCENT', 1 );

更新统计信息可能使优化器为SQL生成新的执行计划,优化器会自行决定何时将之前的执行计划置为无效(invalidating cursors)。因此,统计信息更新后,执行计划并不一定马上更新。你可以使用以下的过程强制更新:

exec dbms_stats.gather_table_stats ( null, 'colours', no_invalidate => false ) ;

这里的cusror指的是cusor cache中缓存的执行计划,详见How do I display and read the execution plans for a SQL statement:

A dictionary view that shows the execution plan for a SQL statement that has been compiled into a cursor in the cursor cache.

目前为止,行统计信息(num_rows和num_distinct)已经正确。

列的Histogram,解决数据偏离问题

Histogram翻译为直方图,感觉并非最好的翻译。

下面来看一下数据偏离(Data Skew)。

执行以下的SQL:

SELECT /*+ gather_plan_statistics */ COUNT(*)
FROM
    bricks
WHERE
    weight = 1;

SELECT
    *
FROM
    TABLE ( dbms_xplan.display_cursor(format => 'ROWSTATS LAST') );

----------------------------------------------------------------
| Id  | Operation          | Name   | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
|   0 | SELECT STATEMENT   |        |      1 |        |      1 |
|   1 |  SORT AGGREGATE    |        |      1 |      1 |      1 |
|*  2 |   TABLE ACCESS FULL| BRICKS |      1 |     11 |    150 |
----------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("WEIGHT"=1)

显然,这不是一个好的执行计划。A-Rows有150行,为何E-Rows只估算出11行?原因是优化器认为取值是均匀分布的。本例即演示了数据偏离中的取值偏离(Value Skew):

SQL> select floor(count(*) / count(distinct weight)) from bricks;

   FLOOR(COUNT(*)/COUNT(DISTINCTWEIGHT))
________________________________________
                                      11

数据偏离还有另一种形式,和数据分布相关,即范围偏离(Range Skew),例如:

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 0 and 100;

select * 
from   table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 400 and 500;

select * 
from   table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 1000 and 1100;

select * 
from   table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));

他们的执行计划如下:

--------------------------------------------------------------------
| Id  | Operation         | Name        | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |             |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| SYS_C008566 |      1 |     26 |    100 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BRICK_ID">=0 AND "BRICK_ID"<=100)

...

--------------------------------------------------------------------
| Id  | Operation         | Name        | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |             |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| SYS_C008566 |      1 |     27 |      0 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BRICK_ID">=400 AND "BRICK_ID"<=500)

...

--------------------------------------------------------------------
| Id  | Operation         | Name        | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |      1 |        |      1 |
|   1 |  SORT AGGREGATE   |             |      1 |      1 |      1 |
|*  2 |   INDEX RANGE SCAN| SYS_C008566 |      1 |     27 |    100 |
--------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("BRICK_ID">=1000 AND "BRICK_ID"<=1100)

为何会估算出26或27呢?这是由于范围100是1200的1/12。而实际的brick_id有300个。因此300/12为25,和26,27近似。

select min(brick_id), max(brick_id), count(distinct(brick_id)) from bricks;
MIN(BRICK_ID) MAX(BRICK_ID) COUNT(DISTINCT(BRICK_ID))
------------- ------------- -------------------------
            1          1200                       300

实际的分布如下,与A-Rows是吻合的:

with rws as (
  select level r from dual
  connect by level <= 15
)
  select (r-1) * 100 as lower_limit, r * 100 as upper_limit, count ( brick_id ) 
  from   rws
  left   join bricks
  on     ceil ( brick_id / 100 ) = r
  group  by r
  order  by r;

LOWER_LIMIT UPPER_LIMIT COUNT(BRICK_ID)
----------- ----------- ---------------
          0         100             100
        100         200               0
        200         300               0
        300         400               0
        400         500               0
        500         600               0
        600         700               0
        700         800               0
        800         900               0
        900        1000               0
       1000        1100             100
       1100        1200             100
       1200        1300               0
       1300        1400               0
       1400        1500               0

15 rows selected. 

为解决以上问题,需要为相应的列生成histograms,首先确认目前还没有histograms,通过NUM_BUCKETS等于1可以确认:

select utcs.column_name, utcs.histogram, utcs.num_buckets
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS'
and    utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );

   COLUMN_NAME    HISTOGRAM    NUM_BUCKETS
______________ ____________ ______________
BRICK_ID       NONE                      1
WEIGHT         NONE                      1

搜集统计信息,此时可以自动为列生成 histograms。为什么这回就自动了呢?因为数据本身存在取值和范围偏差,而之前执行的3个语句又满足一定的条件。

exec dbms_stats.gather_table_stats ( null, 'bricks', no_invalidate => false ) ;

select utcs.column_name, utcs.histogram, utcs.num_buckets
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS'
and    utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );

   COLUMN_NAME    HISTOGRAM    NUM_BUCKETS
______________ ____________ ______________
BRICK_ID       HYBRID                  254
WEIGHT         FREQUENCY                27

这几个条件为:

  1. The column has value skew and statements use the column in range (<, >=, etc.), LIKE, or equality conditions
  2. The column has range skew and the column is used in range or LIKE conditions.
  3. The column has a small number of distinct values (with some repeated values) and the column is used in range (<, >=, etc.), LIKE, or equality conditions
  4. It may also capture histograms when using incremental statistics are used, even if there is no skew. These are ignored by optimizer stats and are out-of-scope for this tutorial.

Histogram有4种类型:

  • Frequency :列的取值基数(cardinality)较小时。
  • Height-balanced :(过时)
  • Hybrid:列的取值基数(cardinality)较大时。
  • Top-frequency:列的取值基数(cardinality)较大时。

以下SQL禁止为列搜集histograms,默认值为for all columns size auto

exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt 以上是关于Oracle数据库统计实验的主要内容,如果未能解决你的问题,请参考以下文章

201671010432词频统计软件项目报告

如何在 Toad for Oracle 中使用自定义代码片段?

Oracle 11g系统自己主动收集统计信息的一些知识

统计实验数据 总结实验结果

sql Oracle代码片段

第九次作业