Oracle数据库统计实验
Posted dingdingfish
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Oracle数据库统计实验相关的知识,希望对你有一定的参考价值。
本实验来自Oracle开发者性能课第2课:Module 2: What are Database Statistics? 是此实验的重演和解读。
首先创建实验用的2张表,即bricks和colours,并生成统计信息:
-- 根据文档Database Reference,statistics_level的默认值为typical,设置成all后,会增加 timed operating system statistics 和plan execution statistics。
alter session set statistics_level = all;
create table bricks (
brick_id integer not null primary key,
colour_rgb_value varchar2(10) not null,
shape varchar2(10) not null,
weight integer not null
);
create table colours (
colour_rgb_value varchar2(10) not null,
colour_name varchar2(10) not null
);
insert into colours values ( 'FF0000', 'red' );
insert into colours values ( '00FF00', 'green' );
insert into colours values ( '0000FF', 'blue' );
insert into bricks
select rownum,
case mod ( level, 3 )
when 0 then 'FF0000'
when 1 then '00FF00'
when 2 then '0000FF'
end,
case mod ( level, 3 )
when 0 then 'cylinder'
when 1 then 'cube'
when 2 then 'pyramid'
end,
floor ( 100 / rownum )
from dual
connect by level <= 100;
insert into bricks
select rownum + 1000,
case mod ( level, 3 )
when 0 then 'FF0000'
when 1 then '00FF00'
when 2 then '0000FF'
end,
case mod ( level, 3 )
when 0 then 'cylinder'
when 1 then 'cube'
when 2 then 'pyramid'
end,
floor ( 200 / rownum )
from dual
connect by level <= 200;
commit;
declare
stats dbms_stats.statrec;
distcnt number;
density number;
nullcnt number;
avgclen number;
begin
dbms_stats.gather_table_stats ( null, 'colours' );
dbms_stats.gather_table_stats ( null, 'bricks' );
dbms_stats.set_table_stats ( null, 'bricks', numrows => 30 );
dbms_stats.set_table_stats ( null, 'colours', numrows => 3000 );
dbms_stats.get_column_stats ( null, 'colours', 'colour_rgb_value',
distcnt => distcnt,
density => density,
nullcnt => nullcnt,
avgclen => avgclen,
srec => stats
);
stats.minval := utl_raw.cast_to_raw ( '0000FF' );
stats.maxval := utl_raw.cast_to_raw ( 'FF0000' );
dbms_stats.set_column_stats ( null, 'colours', 'colour_rgb_value', distcnt => 10, srec => stats );
dbms_stats.set_column_stats ( null, 'bricks', 'colour_rgb_value', distcnt => 10, srec => stats );
end;
/
插入数据后,colours表只有3行:
select /*ansiconsole*/ * from colours
COLOUR_RGB_VALUE COLOUR_NAME
FF0000 red
00FF00 green
0000FF blue
bricks表分两次插入,第一次100行,brick_id从1到100;第二次200行,brick_id从1001到1200。颜色有3种,按brick_id循环往复;形状也有3种(圆柱体,立方体和椎体),按brick_id循环往复。重量则为floor ( 100 / rownum )
和floor ( 200 / rownum )
,因此重量的分布非常不均匀。
插入数据后,马上使用gather_table_stats和get_column_stats生成表和列的统计信息。以下为正确的统计信息:
select ut.table_name, ut.num_rows,
utcs.column_name, utcs.num_distinct,
case utc.data_type
when 'VARCHAR2' then
utl_raw.cast_to_varchar2 ( utcs.low_value )
when 'NUMBER' then
to_char ( utl_raw.cast_to_number ( utcs.low_value ) )
end low_val,
case utc.data_type
when 'VARCHAR2' then
utl_raw.cast_to_varchar2 ( utcs.high_value )
when 'NUMBER' then
to_char ( utl_raw.cast_to_number ( utcs.high_value ) )
end high_val
from user_tables ut
join user_tab_cols utc
on ut.table_name = utc.table_name
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
and utc.column_name = utcs.column_name
and ut.table_name in ('BRICKS', 'COLOURS')
order by ut.table_name, utcs.column_name;
TABLE_NAME NUM_ROWS COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL
_____________ ___________ ___________________ _______________ __________ ___________
BRICKS 300 BRICK_ID 300 1 1200
BRICKS 300 COLOUR_RGB_VALUE 3 0000FF FF0000
BRICKS 300 SHAPE 3 cube pyramid
BRICKS 300 WEIGHT 27 1 200
COLOURS 3 COLOUR_NAME 3 blue red
COLOURS 3 COLOUR_RGB_VALUE 3 0000FF FF0000
6 rows selected.
为了让优化器误判,又使用set_table_stats和set_column_stats修改了表和列的统计信息,修改后的统计信息如下:
TABLE_NAME NUM_ROWS COLUMN_NAME NUM_DISTINCT LOW_VAL HIGH_VAL
_____________ ___________ ___________________ _______________ __________ ___________
BRICKS 30 BRICK_ID 300 1 1200
BRICKS 30 COLOUR_RGB_VALUE 10 0000FF FF0000
BRICKS 30 SHAPE 3 cube pyramid
BRICKS 30 WEIGHT 27 1 200
COLOURS 3000 COLOUR_NAME 3 blue red
COLOURS 3000 COLOUR_RGB_VALUE 10 0000FF FF0000
6 rows selected.
以上过程中指定的null参数,表示使用当前的schema。
具体修改了以下几项:
- bricks表的行数由300改为30
- colours表的行数由3改为3000
- bricks表的colour_rgb_value列,num_distinct由3改为10
- colours表的colour_rgb_value列,num_distinct由3改为10
目前为止,统计信息出现了极大的偏差。
行统计信息
执行以下的SQL,并获取执行计划:
set pages 9999
set lines 120
select /*+ gather_plan_statistics */ c.colour_name, count (*)
from bricks b
join colours c
on c.colour_rgb_value = b.colour_rgb_value
group by c.colour_name;
select * from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));
------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 3 |
| 1 | HASH GROUP BY | | 1 | 3 | 3 |
|* 2 | HASH JOIN | | 1 | 9000 | 300 |
| 3 | TABLE ACCESS FULL| BRICKS | 1 | 30 | 300 |
| 4 | TABLE ACCESS FULL| COLOURS | 1 | 3000 | 3 |
------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("C"."COLOUR_RGB_VALUE"="B"."COLOUR_RGB_VALUE")
先说第一个重要原则:
The optimizer prefers to start a join with the table returning the fewest rows.
即优化器倾向于先从返回最少行的表开始联结(Join)。在执行计划中,A-Rows表示实际返回的行。显然,此执行计划不符合此原则,因为其先从返回300行的表bricks开始,然后再处理返回3行的表colours。显然这是被有偏差的统计信息所误导。需注意,返回较少是考虑了where条件的,对于一个1亿行的表,如果仅返回几行,也认为是返回较少。或者也可以说,编写SQL时,具有较高的selectivity(过滤掉最多的行)的WHERE条件应放在最前面。
E-rows表示优化器估算的行数,Starts表示执行的次数。第二个原则是,如果E-Rows * Starts
的结果和A-Rows
接近,则表示是一个好的执行计划。
因此,我们需要更正统计信息:
exec dbms_stats.gather_table_stats ( null, 'colours' ) ;
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;
SELECT
ut.table_name,
ut.num_rows,
utcs.column_name,
utcs.num_distinct
FROM
user_tables ut
JOIN user_tab_col_statistics utcs ON ut.table_name = utcs.table_name
WHERE
ut.table_name IN ( 'COLOURS', 'BRICKS' );
TABLE_NAME NUM_ROWS COLUMN_NAME NUM_DISTINCT
_____________ ___________ ___________________ _______________
BRICKS 300 BRICK_ID 300
BRICKS 300 COLOUR_RGB_VALUE 3
BRICKS 300 SHAPE 3
BRICKS 300 WEIGHT 27
COLOURS 3 COLOUR_RGB_VALUE 3
COLOURS 3 COLOUR_NAME 3
6 rows selected.
Oracle在两种情况下可自动收集统计信息:
- 每日维护窗口
- 当表更改的行数超过阈值时(默认为10%)
此阈值可以修改,例如以下将默认阈值修改为1%:
select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual;
exec dbms_stats.set_table_prefs ( null, 'colours', 'STALE_PERCENT', 1 );
更新统计信息可能使优化器为SQL生成新的执行计划,优化器会自行决定何时将之前的执行计划置为无效(invalidating cursors)。因此,统计信息更新后,执行计划并不一定马上更新。你可以使用以下的过程强制更新:
exec dbms_stats.gather_table_stats ( null, 'colours', no_invalidate => false ) ;
这里的cusror指的是cusor cache中缓存的执行计划,详见How do I display and read the execution plans for a SQL statement:
A dictionary view that shows the execution plan for a SQL statement that has been compiled into a cursor in the cursor cache.
目前为止,行统计信息(num_rows和num_distinct)已经正确。
列的Histogram,解决数据偏离问题
Histogram翻译为直方图,感觉并非最好的翻译。
下面来看一下数据偏离(Data Skew)。
执行以下的SQL:
SELECT /*+ gather_plan_statistics */ COUNT(*)
FROM
bricks
WHERE
weight = 1;
SELECT
*
FROM
TABLE ( dbms_xplan.display_cursor(format => 'ROWSTATS LAST') );
----------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
----------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | TABLE ACCESS FULL| BRICKS | 1 | 11 | 150 |
----------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("WEIGHT"=1)
显然,这不是一个好的执行计划。A-Rows有150行,为何E-Rows只估算出11行?原因是优化器认为取值是均匀分布的。本例即演示了数据偏离中的取值偏离(Value Skew):
SQL> select floor(count(*) / count(distinct weight)) from bricks;
FLOOR(COUNT(*)/COUNT(DISTINCTWEIGHT))
________________________________________
11
数据偏离还有另一种形式,和数据分布相关,即范围偏离(Range Skew),例如:
select /*+ gather_plan_statistics */count (*) from bricks
where brick_id between 0 and 100;
select *
from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) from bricks
where brick_id between 400 and 500;
select *
from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) from bricks
where brick_id between 1000 and 1100;
select *
from table(dbms_xplan.display_cursor(format => 'ROWSTATS LAST'));
他们的执行计划如下:
--------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | INDEX RANGE SCAN| SYS_C008566 | 1 | 26 | 100 |
--------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("BRICK_ID">=0 AND "BRICK_ID"<=100)
...
--------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | INDEX RANGE SCAN| SYS_C008566 | 1 | 27 | 0 |
--------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("BRICK_ID">=400 AND "BRICK_ID"<=500)
...
--------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows |
--------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | | 1 |
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |
|* 2 | INDEX RANGE SCAN| SYS_C008566 | 1 | 27 | 100 |
--------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("BRICK_ID">=1000 AND "BRICK_ID"<=1100)
为何会估算出26或27呢?这是由于范围100是1200的1/12。而实际的brick_id有300个。因此300/12为25,和26,27近似。
select min(brick_id), max(brick_id), count(distinct(brick_id)) from bricks;
MIN(BRICK_ID) MAX(BRICK_ID) COUNT(DISTINCT(BRICK_ID))
------------- ------------- -------------------------
1 1200 300
实际的分布如下,与A-Rows是吻合的:
with rws as (
select level r from dual
connect by level <= 15
)
select (r-1) * 100 as lower_limit, r * 100 as upper_limit, count ( brick_id )
from rws
left join bricks
on ceil ( brick_id / 100 ) = r
group by r
order by r;
LOWER_LIMIT UPPER_LIMIT COUNT(BRICK_ID)
----------- ----------- ---------------
0 100 100
100 200 0
200 300 0
300 400 0
400 500 0
500 600 0
600 700 0
700 800 0
800 900 0
900 1000 0
1000 1100 100
1100 1200 100
1200 1300 0
1300 1400 0
1400 1500 0
15 rows selected.
为解决以上问题,需要为相应的列生成histograms,首先确认目前还没有histograms,通过NUM_BUCKETS等于1可以确认:
select utcs.column_name, utcs.histogram, utcs.num_buckets
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'BRICKS'
and utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );
COLUMN_NAME HISTOGRAM NUM_BUCKETS
______________ ____________ ______________
BRICK_ID NONE 1
WEIGHT NONE 1
搜集统计信息,此时可以自动为列生成 histograms。为什么这回就自动了呢?因为数据本身存在取值和范围偏差,而之前执行的3个语句又满足一定的条件。
exec dbms_stats.gather_table_stats ( null, 'bricks', no_invalidate => false ) ;
select utcs.column_name, utcs.histogram, utcs.num_buckets
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'BRICKS'
and utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );
COLUMN_NAME HISTOGRAM NUM_BUCKETS
______________ ____________ ______________
BRICK_ID HYBRID 254
WEIGHT FREQUENCY 27
这几个条件为:
- The column has value skew and statements use the column in range (<, >=, etc.), LIKE, or equality conditions
- The column has range skew and the column is used in range or LIKE conditions.
- The column has a small number of distinct values (with some repeated values) and the column is used in range (<, >=, etc.), LIKE, or equality conditions
- It may also capture histograms when using incremental statistics are used, even if there is no skew. These are ignored by optimizer stats and are out-of-scope for this tutorial.
Histogram有4种类型:
- Frequency :列的取值基数(cardinality)较小时。
- Height-balanced :(过时)
- Hybrid:列的取值基数(cardinality)较大时。
- Top-frequency:列的取值基数(cardinality)较大时。
以下SQL禁止为列搜集histograms,默认值为for all columns size auto
:
exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt 以上是关于Oracle数据库统计实验的主要内容,如果未能解决你的问题,请参考以下文章