When you generate statistics for a database, you are optimizing the database, which affects the speed of query processing. More complete and accurate statistics generally result in more efficient query execution strategies, which further result in faster system performance.
The extent of the statistics to be generated for a database can be modified by various options, including restricting the tables and columns that are used.
Note: Deleting all the rows in a table does not delete the statistics.
In VDBA, use the Optimize Database dialog to generate database statistics for the database that is currently selected in the Database Object Manager window.
For more information, see the online help topic Generating Database Statistics. For a complete description of all the options, see online help for the Optimize Database dialog.
At the command line, you can accomplish this task with the optimizedb command. For more information, see the Command Reference Guide.
If a database has not been optimized, the query optimizer assumes that:
where emp.empno = 275
Note: To override the default of 1% for exact match qualifications, use the Configuration-By-Forms opf_exact_key parameter.
1/10 x 1/10 x 1/10 = 1/1000
Note: To override the default of 10% for range qualifications, use the CBF opf_range_key parameter.
All joins are assumed to be one-to-one, based on the smaller data set; for example, when table1 with 100 rows is joined with table2 with 1000 rows, the estimated result is 100 rows.
If these assumptions are not valid for your data, you must optimize the database by generating statistics for it.
Optimizing a database generally requires disk space, because temporary tables are created.
While the optimization process is running, the locking system takes an exclusive lock for a brief period on the user table being optimized. Whenever possible, tables on which statistics are created are not locked while statistics are gathered. This means that updates to the optimized table are not disabled for long periods. However, it is recommended that you optimize the database during off-hours.
When running optimizedb from the command line, the "–o filename" option can be used to collect the desired statistics in an external file without requiring any database locks. At a later, more convenient time, the statistics can be loaded into the catalog with the "-i filename" option, using the same external file. This approach requires no locks on the user table. For more information on the optimizedb command, see the Command Reference Guide.
Because optimizing a database adds column statistics and histogram information to the system catalogs, you should run the system modification operation on your database after optimizing it.
Running system modification modifies the system tables in the database to optimize catalog access. You should do this on a database periodically to maintain peak performance.
To run system modification, use either of the following methods.
In VDBA, use the System Modification dialog, as described in online help topic Optimizing System Tables. For a complete description of all the options, see online help for the System Modification dialog.
At the command line, use the sysmod command. For more information, see the Command Reference Guide.
Example: Run System Modification in VDBA
To specify only those system tables affected by the optimization process, do the following:
When you optimize a database, the following information is collected:
The query optimizer can use this information to calculate the cost of a particular QEP.
When optimizing a database, you can create several types and levels of statistics by specifying options.
First, you can specify what set of data the statistics are gathered on:
Next, either of the following can be created, based on the selected data set. This division determines how much information about the distribution of data the statistics can hold:
When generating statistics for a database, by default all rows of the selected tables are used in the generation of statistics. These non-sampled statistics represent the most accurate statistics possible, because all data is considered.
When the base table is large, you must use sampled statistics. With a sufficient sampling, statistics created are almost identical to statistics created on the full table. The processing for sampled statistics is discussed in greater detail in Sampled Optimizer Statistics.
In VDBA, to specify a percentage of rows to be sampled, enable Statistics on Sample Data check box, and specify the percentage using the Percentage control in the Optimize Database dialog.
When optimizing a database, full statistics are generated by default. Full statistics carry the most information about data distribution (unless the data is modified significantly after statistics are collected).
The cost of their creation (in terms of system resources used), however, is the highest of all types. For each selected column the table is scanned once, and the column values are retrieved in a sorted order. Depending on the availability of indexes on the selected columns, a sort can be required, increasing the cost even further.
The process of generating such complete and accurate statistics can require some time, but there are several ways to adjust this.
You can shorten the process of creating full statistics by enabling the Statistics on Sample Data check box in VDBA.
This example generates full statistics with a sampling of 1% rows of the emp table:
The Specify Tables dialog appears.
You are returned to the Optimize Database dialog.
Minmax statistics are "cheaper" than full statistics to create. In most cases they require only one scan of the entire table. Statistics created have information only about minimum and maximum values for a column. This can be acceptable if the distribution of values in the column is reasonably even. However, if the values of a particular column are skewed, minmax statistics can mislead the query optimizer and result in poor query plan choices.
In VDBA, to specify minmax statistics, enable the Min Max Values check box in the Optimize Database dialog.
This example generates statistics with only minimum and maximum values for the employee table:
Key column statistics create full or minmax statistics on key or indexed columns only. These statistics are generated by enabling the Gen Statistics on Keys/Index check box in the Optimize Database dialog. The effect of this option is the same as specifying key and index columns for a table using the Specify Columns dialog. For more information, see Generating Database Statistics in online help. Using the Gen Statistics on Keys/Index check box saves you some work by determining from the catalogs which columns are keys and indexed.
This example generates full statistics for all key and indexed columns in the employee table:
To generate minmax statistics on a 1% sampling, do the following in the Optimize Database dialog:
All key and indexed columns in the table are processed regardless of any column designations specified using the Specify Columns dialog. For example, assume that dno is a data column and kno is a key column in the employee table.
The following example for generating full statistics is the same as the first example in this section, except that in addition to key and index columns, statistics are generated also for the dno column:
The kno column designation in Step 6 is superfluous, because this is a key column and the Gen Statistics on Keys/Index check box is enabled.
Statistics can be read in from a text file. The input file must conform to a certain format, which is identical to that produced when you direct output to a file when displaying statistics. Display Optimizer Statistics provides more information.
The file can be edited to reflect changes in data distribution as required, before submitting the file for use during the optimization process. However, this can potentially mislead the query optimizer into generating poor query plans. Manually editing statistics must be done only if you have a full understanding of the data and how the statistics are used in Ingres.
Details on creating and using text files as input when optimizing a database are provided in Statistics in Text Files.
Collecting statistics is generally a time-consuming process, because it requires scanning large amounts of data. The techniques described so far--except for Key Column Statistics--collect statistics on all columns of the indicated tables.
It is not necessary, however, to choose all columns in all tables in your database when optimizing. The query optimizer uses statistics on a column only if needed to restrict data or if it is specified in a join. Therefore, it is a good idea to limit creation of statistics only to those columns used in a where clause.
The DBA or table owner usually understands the table structure and content and is able to predict how the various columns are used in queries. Thus, someone familiar with the table can identify columns that are used in the where clause.
Given these queries:
select name, age from emp
where dept = 'Tech Support';
select e.name, e.salary, b.address
from emp e, bldg b, dept d
where e.dept = d.dname
and d.bldg = b.bldg;
Candidate columns for optimization are:
emp table: dept
dept table: dname, bldg
bldg table: bldg
Based on their use in these sample queries, there is no reason to obtain statistics on employee name, age, salary, or building address. These columns are listed in the target list only, not the where clause of the query.
Columns used in the where clause are often indexed to speed up joins and execution of constraints. If this is the case, specify the Gen Statistics on Keys/Index option to create statistics on key (that is, indexed) columns. However, it is often just as important to create statistics on nonindexed columns referenced in where clauses.
In VDBA, to create statistics on key (that is, indexed) columns, enable the Gen Statistics on Keys/Index check box in the Optimize Database dialog.
When you optimize a database, output is generated to show the statistics. For example, if the Print Histogram option was enabled when optimizing the database, and you chose to optimize the name and sex columns of the emp table, the following output is typical:
*** statistics for database demodb version: 00850
*** table emp1 rows:1536 pages:50 overflow pages:49
*** column name of type varchar (length:30, scale:0, nullable)
date:2000_02_24 15:40:38 GMT unique values:16.000
repetition factor:96.000 unique flag:N complete flag:0
domain:0 histogram cells:32 null count:0.0000 value length:8
cell: 0 count:0.0000 repf:0.0000 value:Abbot \037
cell: 1 count:0.0625 repf:96.0000 value:Abbot
cell: 2 count:0.0000 repf:0.0000 value:Beirne \037
cell: 3 count:0.0625 repf:96.0000 value:Beirne
cell: 4 count:0.0000 repf:0.0000 value:Buchanam
cell: 5 count:0.0625 repf:96.0000 value:Buchanan
cell: 6 count:0.0000 repf:0.0000 value:Cooper \037
cell: 7 count:0.0625 repf:96.0000 value:Cooper
cell: 8 count:0.0000 repf:0.0000 value:Dunham \037
cell: 9 count:0.0625 repf:96.0000 value:Dunham
cell: 10 count:0.0000 repf:0.0000 value:Ganley \037
cell: 11 count:0.0625 repf:96.0000 value:Ganley
cell: 12 count:0.0000 repf:0.0000 value:Hegner \037
cell: 13 count:0.0625 repf:96.0000 value:Hegner
cell: 14 count:0.0000 repf:0.0000 value:Jackson\037
cell: 15 count:0.0625 repf:96.0000 value:Jackson
cell: 16 count:0.0000 repf:0.0000 value:Klietz \037
cell: 17 count:0.0625 repf:96.0000 value:Klietz
cell: 18 count:0.0000 repf:0.0000 value:Myers \037
cell: 19 count:0.0625 repf:96.0000 value:Myers
cell: 20 count:0.0000 repf:0.0000 value:Petersom
cell: 21 count:0.0625 repf:96.0000 value:Peterson
cell: 22 count:0.0000 repf:0.0000 value:Rumpel \037
cell: 23 count:0.0625 repf:96.0000 value:Rumpel
cell: 24 count:0.0000 repf:0.0000 value:Singer \037
cell: 25 count:0.0625 repf:96.0000 value:Singer
cell: 26 count:0.0000 repf:0.0000 value:Stec \037
cell: 27 count:0.0625 repf:96.0000 value:Stec
cell: 28 count:0.0000 repf:0.0000 value:Washings
cell: 29 count:0.0625 repf:96.0000 value:Washingt
cell: 30 count:0.0000 repf:0.0000 value:Zywicki\037
cell: 31 count:0.0625 repf:96.0000 value:Zywicki
unique chars: 14 9 11 11 9 11 6 3
char set densities: 0.5200 0.3333 0.4762 0.6667 0.0952 0.1111 0.0633 0.0238
*** statistics for database demodb version: 00850
*** table emp rows:1536 pages:50 overflow pages:49
*** column sex of type char (length:1, scale:0, nullable)
date:23-feb-2000 10:12:00 unique values:2.000
repetition factor:768.000 unique flag:N complete flag:0
domain:0 histogram cells:4 null count:0.0000000 value length:1
cell: 0 count:0.0000000 repf:0.0000000 value:E
cell: 1 count:0.0006510 repf:1.0000000 value:F
cell: 2 count:0.0000000 repf:0.0000000 value:L
cell: 3 count:0.9993489 repf:1535.0000000 value:M
unique chars: 2
char set densities: 0.1428571
The display items are as follows:
Database name
Version of the catalog from which statistics were derived. Shown only if version is 00605 or later.
Table currently processing
Current number of rows in table as stored in the iitables catalog
Number of pages (from the iitables catalog)
Number of overflow pages (from the iitables catalog)
Column currently processing
Column data type. The length, scale, and nullable indicators are obtained from the iicolumns catalog.
Time and date when statistics were created
Number of unique values found in the table
Average number of rows per unique value. The repetition factor times the number of unique values must produce the row count.
"Y" if unique or nearly unique, "N" if not unique
All possible values for the column exist in this table. When this column is used in a join predicate with some other column, it tells the query optimizer that every value in the other column must be a value of this column as well. This knowledge enables the query optimizer to build more accurate query plans for the join.
Not used
Number of histogram cells used (0 to 500 maximum)
Proportion of column values that are NULL, expressed as a real number between 0.0 and 1.0
Length of cell values
For each cell, a cell number, count (proportion of rows whose values fall into this cell: between 0.0 and 1.0), average number of duplicates per unique value in the cell, and the upper bound value for the cell
Number of unique characters per character position. Shown only for character columns.
Relative density of the character set for each character position. Shown only for character columns.
The number of unique values the column has is calculated. The count listed for each cell is the fraction of all the values falling between the lower and upper boundaries of the cell. Statistics for the sex column show that there are no rows with values less than or equal to 'E,' 0.06510% of rows with values equal to 'F,' no rows with values in the 'G' to 'L' range, and 99.93% of the rows with values equal to 'M.' The cell count includes those rows whose column values are greater than the lower cell bound but less than or equal to the upper cell bound. All cell counts must add to 1.0, representing 100% of the table rows.
Looking at the cells for the name column, you see that between the lower bound cell 0, "Abbot \037", and cell 1, "Abbot", 6.25% of the employee's names are located:
cell: 0 count:0.0000000 repf:0.0000000 value:Abbot \037
cell: 1 count:0.0625000 repf:96.000000 value:Abbot
A restriction such as the following brings back about 6.25% of the rows in the table:
where emp.name = 'Abbot'
The character cell value \037 at the end of the string is octal for the ASCII character that is one less than the blank. Therefore, cell 0 in the name example represents the value immediately preceding 'Abbot' in cell 1. This indicates that the count for cell 1 includes all rows whose name column is exactly 'Abbot.'
In addition to the count and value, each cell of a histogram also contains a repetition factor (labeled "repf" in the statistics output). This is the average number of rows per unique value for each cell, or the "per-cell" repetition factor. The query optimizer uses these values to compute more accurate estimates of the number of rows that result from a join. This is distinct from the repetition factor for the whole column displayed in the header portion of the statistics output.
A histogram can have up to 500 cells. The first cell of a histogram is always an "empty" cell, with count = 0.0. It serves as the lower boundary for all values in the histogram. Thus, all values in the column are greater than the value in the first cell. This first cell is usually not included when discussing number of cells, but it is included when statistics are displayed.
A histogram in which there is a separate cell for each distinct column value is known as an "exact" histogram. If there are more distinct values in the column than cells in the histogram, some sets of contiguous values must be merged into a single cell. Histograms in which some cells represent multiple column values are known as "inexact" histograms.
You can control the number of cells used, even for inexact histograms. You can choose to set the number of inexact cells to the same number you chose for an exact histogram, or to some other number that seems appropriate. If your data is unevenly distributed, the data distribution cannot be apparent when merged into an inexact histogram with the default 15 cells. Increasing the number of cells can help.
You can control the number of cells your data is merged into even if you go above the maximum number of histogram cells you requested. You can choose to set the default merging number to the same number you chose for the maximum, or a lesser number, if the default of 15 cells seems inappropriate. If your data is unevenly distributed, the data distribution cannot be apparent when merged into the default 15 cells, and controlling the merging factor can help.
To control the maximum histogram cells, use the Max Cells "Exact" Histogram option in the Optimize Database dialog (the maximum value accepted is 249). You can control the number of cells that your data is merged into if you go beyond the maximum number of unique values using the Max Cells "Inexact" Histogram option in the Optimize Database dialog. By default, the number of cells used when merging into an inexact histogram is 15, and the maximum value is 499.
For example, set the maximum number of unique histogram cells to 200, and if there are more than 200 unique values, merge the histogram into 200 cells. To do this, set both the Max Cells "Exact" Histogram and the Max Cells "Inexact" Histogram options in the Optimize Database dialog to 200.
Set the maximum number of unique histogram cells to 100, and if there are more than 100 unique values, merge the histogram into 50 cells. To do this, set Max Cells "Exact" Histogram to 100 and Max Cells "Inexact" Histogram to 50.
When using these options, remember that the goal is to accurately reflect the distribution of your data so that there can be an accurate estimate of the resultant number of rows from queries that restrict on these columns. The query optimizer uses linear interpolation techniques to compute row estimates from an inexact histogram and the more cells it has to work with, the more accurate are the resulting estimates. The cost of building a histogram is not dependent on the number of cells it contains and is not a factor when determining how many cells to request.
Because global temporary tables only exist for the duration of an Ingres session, Optimize Database cannot be used to gather statistical information about them. Without histograms, the query optimizer has no knowledge about the value distributions of the columns in a global temporary table. Ingres maintains a reasonably accurate row count for global temporary tables, and this row count can be used by the query optimizer to compile a query which accesses a global temporary table.
The row counts alone are usually enough to permit the compilation of efficient query plans from queries, which reference global temporary tables, in particular because they often contain relatively small data volumes. However, the lack of histograms on global temporary tables can cause poor estimates of the number of rows resulting from the application of restriction or join predicates. These poor estimates can in turn cause the generation of inefficient query plans. This typically happens with large global temporary tables or tables with columns having skewed value distributions, which are not well handled by the default estimation algorithms of the query optimizer.
To help deal with such situations, there is a mechanism available to associate "model" histograms with global temporary tables.
To associate "model" histograms with global temporary tables, follow these steps:
When the query optimizer analyzes where clause predicates with columns from a global temporary table, it looks for the catalog definition of a similarly named persistent table with a schema qualifier matching the ID of the executing user or _gtt_model. If one is found, it looks for histograms on similarly named columns whose type and length exactly match those of the global temporary table columns. If these conditions are satisfied, it uses the model histograms.
Not all faulty query plans involving global temporary tables can be improved this way. The modeling technique depends on the fact that all or most instances of the global temporary table have similar value distributions in the histogrammed columns. If this is not true, a single instance of the table (as with the model persistent table) will not be representative of them all, and can improve the query plans in some executions of the application, but degrade other executions.
Optimization does not necessarily need to be run whenever data is changed or added to the database. Optimization collects statistics that represent percentages of data in ranges and repetition factors. For instance, the statistics collected on employee gender show that 49% of the employees are female and 51% are male. Unless this percentage shifts dramatically, there is no need to rerun optimization on this column, even if the total number of employees changes.
You must rerun optimization if there are modifications to the database that alter the following:
For example, if you had run complete statistics on the empno column early in your company's history, your repetition factor is correct because all employees still have unique employee numbers. If you used ranges of employee numbers in any way, as you added new employees your histogram information is less accurate.
If your company originally had 100 employees, 10% of the employees have employee numbers greater than 90. If the company hired an additional 100 employees, 55% of the employees have employee numbers greater than 90, but the original histogram information does not reflect this.
Columns that show this type of "receding end" growth and are used in range queries can periodically need to have optimization run on them (exact match on employee number is not affected, because the information that says all employee numbers are unique is still correct).
Even if the statistics are not up-to-date, the query results are still correct.
If statistics are available on a column referenced in a where clause, the query optimizer uses the information to choose the most efficient QEP. Understanding how this information is used can be helpful in analyzing query performance. For more information, see Query Execution Plans.
Two QEPs showing the effect of optimization are presented here. The first is a QEP before optimizing; the second shows the same query after optimization. The query used is a join, where both the r and s tables use the B-tree storage structure:
select * from r, s
where s.a > 4000 and r.a = s.a;
QEP Before Optimization
Before obtaining statistics, the optimizer chooses a full sort-merge (FSM) join, because it assumes that 10% of each table satisfies the qualification "a > 4000," as shown in the QEP diagram below:
QUERY PLAN 4,1, no timeout, of main query
FSM Join(a)
Heap
Pages 1 Tups 267
D15 C44
/ \
Proj-rest Proj-rest
Sorted(a) Sorted(a)
Pages 5 Tups 1235 Pages 1 Tups 267
D11 C12 D4 C3
/ /
r s
B-Tree(a) B-Tree (a)
Pages 172 Tups 12352 Pages 37 Tups 2666
QEP After Optimization
After obtaining statistics, the optimizer chooses a Key join, because only one row satisfies the qualification "a > 4000," as shown in the QEP diagram below:
QUERY PLAN 4,1, no timeout, of main query
K Join(a)
Heap
Pages 1 Tups 1
D4 C1
/ \
Proj-rest r
Sorted(a) B-Tree(a)
Pages 1 Tups 1 Pages 172 Tups 12352
D2 C1
/
s
B-Tree(a)
Pages 37 Tups 2666
The cost of the key join is significantly less than the cost of the FSM join because the join involves far fewer rows.