Database Statistics

When you generate statistics for a database, you are optimizing the database, which affects the speed of query processing. More complete and accurate statistics generally result in more efficient query execution strategies, which further result in faster system performance.

The extent of the statistics to be generated for a database can be modified by various options, including restricting the tables and columns that are used.

Note: Deleting all the rows in a table does not delete the statistics.

Generate Statistics

In VDBA, use the Optimize Database dialog to generate database statistics for the database that is currently selected in the Database Object Manager window.

For more information, see the online help topic Generating Database Statistics. For a complete description of all the options, see online help for the Optimize Database dialog.

At the command line, you can accomplish this task with the optimizedb command. For more information, see the Command Reference Guide.

Assumptions of the Query Optimizer

If a database has not been optimized, the query optimizer assumes that:

All exact match restrictions return 1% of the table, except where a key or index is defined to be unique, in which case one row is returned for the indexed attribute:
where emp.empno = 275

Note: To override the default of 1% for exact match qualifications, use the Configuration-By-Forms opf_exact_key parameter.
All range qualifications (<, <=, >=, >) and like predicates, in which the first character is not a wild card, return 10% of the table for each qualification. Thus, if there are three (non-exact match) qualifications, the following amount of the table is selected:
1/10 x 1/10 x 1/10 = 1/1000

Note: To override the default of 10% for range qualifications, use the CBF opf_range_key parameter.
All "not equals" qualifications (<>) and like predicates, in which the first character is a wild card, return 50% of the table for each qualification. The default 50% for these qualifications can be overidden by the Configuration-By-Forms opf_non_key parameter.
All joins are assumed to be one-to-one, based on the smaller data set; for example, when table1 with 100 rows is joined with table2 with 1000 rows, the estimated result is 100 rows.
When there are restrictions on the join tables, the number of resulting rows is greater than or equal to the lower bound of 10% of qualifying rows from the smaller table.

If these assumptions are not valid for your data, you must optimize the database by generating statistics for it.

Resources Required During Optimization

Optimizing a database generally requires disk space, because temporary tables are created.

While the optimization process is running, the locking system takes an exclusive lock for a brief period on the user table being optimized. Whenever possible, tables on which statistics are created are not locked while statistics are gathered. This means that updates to the optimized table are not disabled for long periods. However, it is recommended that you optimize the database during off-hours.

When running optimizedb from the command line, the "–o filename" option can be used to collect the desired statistics in an external file without requiring any database locks. At a later, more convenient time, the statistics can be loaded into the catalog with the "-i filename" option, using the same external file. This approach requires no locks on the user table. For more information on the optimizedb command, see the Command Reference Guide.

System Modification After Optimization

Because optimizing a database adds column statistics and histogram information to the system catalogs, you should run the system modification operation on your database after optimizing it.

Running system modification modifies the system tables in the database to optimize catalog access. You should do this on a database periodically to maintain peak performance.

Run System Modification

To run system modification, use either of the following methods.

In VDBA, use the System Modification dialog, as described in online help topic Optimizing System Tables. For a complete description of all the options, see online help for the System Modification dialog.

At the command line, use the sysmod command. For more information, see the Command Reference Guide.

Example: Run System Modification in VDBA

To specify only those system tables affected by the optimization process, do the following:

Open the System Modification dialog in VDBA.
Enable Specify in the Tables group box.
Enable iihistogram and iistatistics in the resulting list box, and click OK.

Information Collected by the Optimizer

When you optimize a database, the following information is collected:

The number of unique values in those columns selected for optimization
A count showing what percentage of those column values are NULL
The number of duplicate values there are in those columns in the whole table, on average, or whether all values are unique. This is termed the repetition factor.
A histogram showing data distribution for each of those columns. Sample histograms and further information on their contents can be found in Optimization Output.

The query optimizer can use this information to calculate the cost of a particular QEP.

Types of Statistics to Generate

When optimizing a database, you can create several types and levels of statistics by specifying options.

First, you can specify what set of data the statistics are gathered on:

Non-sampled statistics—all rows in the selected tables are retrieved
Sampled statistics—a subset of rows from the selected tables is retrieved

Next, either of the following can be created, based on the selected data set. This division determines how much information about the distribution of data the statistics can hold:

Full statistics—a histogram for the whole range of column values is created
Minmax statistics—a histogram showing only minimum and maximum values is created

Non-Sampled and Sampled Statistics

When generating statistics for a database, by default all rows of the selected tables are used in the generation of statistics. These non-sampled statistics represent the most accurate statistics possible, because all data is considered.

When the base table is large, you must use sampled statistics. With a sufficient sampling, statistics created are almost identical to statistics created on the full table. The processing for sampled statistics is discussed in greater detail in Sampled Optimizer Statistics.

Generate Sampled Statistics

In VDBA, to specify a percentage of rows to be sampled, enable Statistics on Sample Data check box, and specify the percentage using the Percentage control in the Optimize Database dialog.

Full Statistics

When optimizing a database, full statistics are generated by default. Full statistics carry the most information about data distribution (unless the data is modified significantly after statistics are collected).

The cost of their creation (in terms of system resources used), however, is the highest of all types. For each selected column the table is scanned once, and the column values are retrieved in a sorted order. Depending on the availability of indexes on the selected columns, a sort can be required, increasing the cost even further.

The process of generating such complete and accurate statistics can require some time, but there are several ways to adjust this.

Generate Full Statistics on Sample Data

You can shorten the process of creating full statistics by enabling the Statistics on Sample Data check box in VDBA.

This example generates full statistics with a sampling of 1% rows of the emp table:

In VDBA, open the Optimize Database dialog for the database. (For more information, see online help.) Specify the following:
- Enable the Statistics on Sample Data check box.
- Enter 1 for the Percentage.
- Enable the Specify Tables check box, and then click Tables.
  The Specify Tables dialog appears.
Enable the emp table, and click OK.
You are returned to the Optimize Database dialog.
Click OK.

Minmax Statistics

Minmax statistics are "cheaper" than full statistics to create. In most cases they require only one scan of the entire table. Statistics created have information only about minimum and maximum values for a column. This can be acceptable if the distribution of values in the column is reasonably even. However, if the values of a particular column are skewed, minmax statistics can mislead the query optimizer and result in poor query plan choices.

In VDBA, to specify minmax statistics, enable the Min Max Values check box in the Optimize Database dialog.

Example: Generate Statistics with Only Minimum and Maximum Values for a Table

This example generates statistics with only minimum and maximum values for the employee table:

In VDBA, open the Optimize Database dialog for the table. For more information, see online help.
Enable the Min Max Values check box.
Enable the Specify Tables check box.
Click Tables to open the Specify Tables dialog.
Enable the employee table, and click OK.
Click OK.

Key Column Statistics

Key column statistics create full or minmax statistics on key or indexed columns only. These statistics are generated by enabling the Gen Statistics on Keys/Index check box in the Optimize Database dialog. The effect of this option is the same as specifying key and index columns for a table using the Specify Columns dialog. For more information, see Generating Database Statistics in online help. Using the Gen Statistics on Keys/Index check box saves you some work by determining from the catalogs which columns are keys and indexed.

Examples: Create Statistics on Key or Indexed Columns Only

This example generates full statistics for all key and indexed columns in the employee table:

In VDBA, open Optimize Database dialog for the table. For more information, see online help.
Enable the Gen Statistics on Keys/Index check box.
Enable the Specify Tables check box.
Click Tables to open the Specify Tables dialog.
Enable the employee table, and click OK.
Click OK.

To generate minmax statistics on a 1% sampling, do the following in the Optimize Database dialog:

Enable the Gen Statistics on Keys/Index check box.
Enable the Min Max Values check box.
Enable the Statistics on Sample Data check box.
Enter 1 for the Percentage.
Enable the Specify Tables check box.
Click Tables to open the Specify Tables dialog.
Enable the employee table, and click OK.
Click OK.

All key and indexed columns in the table are processed regardless of any column designations specified using the Specify Columns dialog. For example, assume that dno is a data column and kno is a key column in the employee table.

The following example for generating full statistics is the same as the first example in this section, except that in addition to key and index columns, statistics are generated also for the dno column:

Enable the Gen Statistics on Keys/Index check box.
Enable the Specify Tables check box.
Click Tables to open the Specify Tables dialog.
Enable the employee table, and click OK
Enable the Specify Columns check box.
Click Columns to open the Specify Columns dialog.
Enable the dno and kno columns, and click OK.
Click OK.

The kno column designation in Step 6 is superfluous, because this is a key column and the Gen Statistics on Keys/Index check box is enabled.

Statistics from an Input Text File

Statistics can be read in from a text file. The input file must conform to a certain format, which is identical to that produced when you direct output to a file when displaying statistics. Display Optimizer Statistics provides more information.

The file can be edited to reflect changes in data distribution as required, before submitting the file for use during the optimization process. However, this can potentially mislead the query optimizer into generating poor query plans. Manually editing statistics must be done only if you have a full understanding of the data and how the statistics are used in Ingres.

Details on creating and using text files as input when optimizing a database are provided in Statistics in Text Files.

About Optimizing Columns

Collecting statistics is generally a time-consuming process, because it requires scanning large amounts of data. The techniques described so far--except for Key Column Statistics--collect statistics on all columns of the indicated tables.

It is not necessary, however, to choose all columns in all tables in your database when optimizing. The query optimizer uses statistics on a column only if needed to restrict data or if it is specified in a join. Therefore, it is a good idea to limit creation of statistics only to those columns used in a where clause.

The DBA or table owner usually understands the table structure and content and is able to predict how the various columns are used in queries. Thus, someone familiar with the table can identify columns that are used in the where clause.

Given these queries:

select name, age from emp

where dept = 'Tech Support';

select e.name, e.salary, b.address

from emp e, bldg b, dept d

where e.dept = d.dname

and d.bldg = b.bldg;

Candidate columns for optimization are:

emp table:  dept
dept table:  dname, bldg
bldg table:  bldg

Based on their use in these sample queries, there is no reason to obtain statistics on employee name, age, salary, or building address. These columns are listed in the target list only, not the where clause of the query.

Columns used in the where clause are often indexed to speed up joins and execution of constraints. If this is the case, specify the Gen Statistics on Keys/Index option to create statistics on key (that is, indexed) columns. However, it is often just as important to create statistics on nonindexed columns referenced in where clauses.

Create Statistics on Keys

In VDBA, to create statistics on key (that is, indexed) columns, enable the Gen Statistics on Keys/Index check box in the Optimize Database dialog.

Optimization Output

When you optimize a database, output is generated to show the statistics. For example, if the Print Histogram option was enabled when optimizing the database, and you chose to optimize the name and sex columns of the emp table, the following output is typical:

*** statistics for database demodb version: 00850

*** table emp1 rows:1536 pages:50 overflow pages:49

*** column name of type varchar (length:30, scale:0, nullable)

date:2000_02_24 15:40:38 GMT unique values:16.000

repetition factor:96.000 unique flag:N complete flag:0

domain:0 histogram cells:32 null count:0.0000 value length:8

cell: 0 count:0.0000 repf:0.0000 value:Abbot \037

cell: 1 count:0.0625 repf:96.0000 value:Abbot

cell: 2 count:0.0000 repf:0.0000 value:Beirne \037

cell: 3 count:0.0625 repf:96.0000 value:Beirne

cell: 4 count:0.0000 repf:0.0000 value:Buchanam

cell: 5 count:0.0625 repf:96.0000 value:Buchanan

cell: 6 count:0.0000 repf:0.0000 value:Cooper \037

cell: 7 count:0.0625 repf:96.0000 value:Cooper

cell: 8 count:0.0000 repf:0.0000 value:Dunham \037

cell: 9 count:0.0625 repf:96.0000 value:Dunham

cell: 10 count:0.0000 repf:0.0000 value:Ganley \037

cell: 11 count:0.0625 repf:96.0000 value:Ganley

cell: 12 count:0.0000 repf:0.0000 value:Hegner \037

cell: 13 count:0.0625 repf:96.0000 value:Hegner

cell: 14 count:0.0000 repf:0.0000 value:Jackson\037

cell: 15 count:0.0625 repf:96.0000 value:Jackson

cell: 16 count:0.0000 repf:0.0000 value:Klietz \037

cell: 17 count:0.0625 repf:96.0000 value:Klietz

cell: 18 count:0.0000 repf:0.0000 value:Myers \037

cell: 19 count:0.0625 repf:96.0000 value:Myers

cell: 20 count:0.0000 repf:0.0000 value:Petersom

cell: 21 count:0.0625 repf:96.0000 value:Peterson

cell: 22 count:0.0000 repf:0.0000 value:Rumpel \037

cell: 23 count:0.0625 repf:96.0000 value:Rumpel

cell: 24 count:0.0000 repf:0.0000 value:Singer \037

cell: 25 count:0.0625 repf:96.0000 value:Singer

cell: 26 count:0.0000 repf:0.0000 value:Stec \037

cell: 27 count:0.0625 repf:96.0000 value:Stec

cell: 28 count:0.0000 repf:0.0000 value:Washings

cell: 29 count:0.0625 repf:96.0000 value:Washingt

cell: 30 count:0.0000 repf:0.0000 value:Zywicki\037

cell: 31 count:0.0625 repf:96.0000 value:Zywicki

unique chars: 14 9 11 11 9 11 6 3

char set densities: 0.5200 0.3333 0.4762 0.6667 0.0952 0.1111 0.0633 0.0238

*** statistics for database demodb version: 00850

*** table emp rows:1536 pages:50 overflow pages:49

*** column sex of type char (length:1, scale:0, nullable)

date:23-feb-2000 10:12:00 unique values:2.000

repetition factor:768.000 unique flag:N complete flag:0

domain:0 histogram cells:4 null count:0.0000000 value length:1

cell: 0 count:0.0000000 repf:0.0000000 value:E

cell: 1 count:0.0006510 repf:1.0000000 value:F

cell: 2 count:0.0000000 repf:0.0000000 value:L

cell: 3 count:0.9993489 repf:1535.0000000 value:M

unique chars: 2

char set densities: 0.1428571

The display items are as follows:

database
Database name
version
Version of the catalog from which statistics were derived. Shown only if version is 00605 or later.
table
Table currently processing
rows
Current number of rows in table as stored in the iitables catalog
pages
Number of pages (from the iitables catalog)
overflow pages
Number of overflow pages (from the iitables catalog)
column
Column currently processing
type
Column data type. The length, scale, and nullable indicators are obtained from the iicolumns catalog.
date
Time and date when statistics were created
unique values
Number of unique values found in the table
repetition factor
Average number of rows per unique value. The repetition factor times the number of unique values must produce the row count.
unique flag
"Y" if unique or nearly unique, "N" if not unique
complete flag
All possible values for the column exist in this table. When this column is used in a join predicate with some other column, it tells the query optimizer that every value in the other column must be a value of this column as well. This knowledge enables the query optimizer to build more accurate query plans for the join.
domain
Not used
histogram cells
Number of histogram cells used (0 to 500 maximum)
null count
Proportion of column values that are NULL, expressed as a real number between 0.0 and 1.0
value length
Length of cell values
cell
For each cell, a cell number, count (proportion of rows whose values fall into this cell: between 0.0 and 1.0), average number of duplicates per unique value in the cell, and the upper bound value for the cell
unique chars
Number of unique characters per character position. Shown only for character columns.
char set densities
Relative density of the character set for each character position. Shown only for character columns.

The number of unique values the column has is calculated. The count listed for each cell is the fraction of all the values falling between the lower and upper boundaries of the cell. Statistics for the sex column show that there are no rows with values less than or equal to 'E,' 0.06510% of rows with values equal to 'F,' no rows with values in the 'G' to 'L' range, and 99.93% of the rows with values equal to 'M.' The cell count includes those rows whose column values are greater than the lower cell bound but less than or equal to the upper cell bound. All cell counts must add to 1.0, representing 100% of the table rows.

Looking at the cells for the name column, you see that between the lower bound cell 0, "Abbot \037", and cell 1, "Abbot", 6.25% of the employee's names are located:

cell: 0 count:0.0000000 repf:0.0000000 value:Abbot \037

cell: 1 count:0.0625000 repf:96.000000 value:Abbot

A restriction such as the following brings back about 6.25% of the rows in the table:

where emp.name = 'Abbot'

The character cell value \037 at the end of the string is octal for the ASCII character that is one less than the blank. Therefore, cell 0 in the name example represents the value immediately preceding 'Abbot' in cell 1. This indicates that the count for cell 1 includes all rows whose name column is exactly 'Abbot.'

In addition to the count and value, each cell of a histogram also contains a repetition factor (labeled "repf" in the statistics output). This is the average number of rows per unique value for each cell, or the "per-cell" repetition factor. The query optimizer uses these values to compute more accurate estimates of the number of rows that result from a join. This is distinct from the repetition factor for the whole column displayed in the header portion of the statistics output.

Histogram Cells

A histogram can have up to 500 cells. The first cell of a histogram is always an "empty" cell, with count = 0.0. It serves as the lower boundary for all values in the histogram. Thus, all values in the column are greater than the value in the first cell. This first cell is usually not included when discussing number of cells, but it is included when statistics are displayed.

A histogram in which there is a separate cell for each distinct column value is known as an "exact" histogram. If there are more distinct values in the column than cells in the histogram, some sets of contiguous values must be merged into a single cell. Histograms in which some cells represent multiple column values are known as "inexact" histograms.

You can control the number of cells used, even for inexact histograms. You can choose to set the number of inexact cells to the same number you chose for an exact histogram, or to some other number that seems appropriate. If your data is unevenly distributed, the data distribution cannot be apparent when merged into an inexact histogram with the default 15 cells. Increasing the number of cells can help.

You can control the number of cells your data is merged into even if you go above the maximum number of histogram cells you requested. You can choose to set the default merging number to the same number you chose for the maximum, or a lesser number, if the default of 15 cells seems inappropriate. If your data is unevenly distributed, the data distribution cannot be apparent when merged into the default 15 cells, and controlling the merging factor can help.

To control the maximum histogram cells, use the Max Cells "Exact" Histogram option in the Optimize Database dialog (the maximum value accepted is 249). You can control the number of cells that your data is merged into if you go beyond the maximum number of unique values using the Max Cells "Inexact" Histogram option in the Optimize Database dialog. By default, the number of cells used when merging into an inexact histogram is 15, and the maximum value is 499.

For example, set the maximum number of unique histogram cells to 200, and if there are more than 200 unique values, merge the histogram into 200 cells. To do this, set both the Max Cells "Exact" Histogram and the Max Cells "Inexact" Histogram options in the Optimize Database dialog to 200.

Set the maximum number of unique histogram cells to 100, and if there are more than 100 unique values, merge the histogram into 50 cells. To do this, set Max Cells "Exact" Histogram to 100 and Max Cells "Inexact" Histogram to 50.

When using these options, remember that the goal is to accurately reflect the distribution of your data so that there can be an accurate estimate of the resultant number of rows from queries that restrict on these columns. The query optimizer uses linear interpolation techniques to compute row estimates from an inexact histogram and the more cells it has to work with, the more accurate are the resulting estimates. The cost of building a histogram is not dependent on the number of cells it contains and is not a factor when determining how many cells to request.

Statistics and Global Temporary Tables

Because global temporary tables only exist for the duration of an Ingres session, Optimize Database cannot be used to gather statistical information about them. Without histograms, the query optimizer has no knowledge about the value distributions of the columns in a global temporary table. Ingres maintains a reasonably accurate row count for global temporary tables, and this row count can be used by the query optimizer to compile a query which accesses a global temporary table.

The row counts alone are usually enough to permit the compilation of efficient query plans from queries, which reference global temporary tables, in particular because they often contain relatively small data volumes. However, the lack of histograms on global temporary tables can cause poor estimates of the number of rows resulting from the application of restriction or join predicates. These poor estimates can in turn cause the generation of inefficient query plans. This typically happens with large global temporary tables or tables with columns having skewed value distributions, which are not well handled by the default estimation algorithms of the query optimizer.

To help deal with such situations, there is a mechanism available to associate "model" histograms with global temporary tables.

How You Associate “Model” Histograms with Global Temporary Tables

To associate "model" histograms with global temporary tables, follow these steps:

Create a persistent table with the same name as the global temporary table being modeled. The schema qualifier for the table must be either the user ID of the executing user of the application creating and accessing the global temporary table, or the special user ID "_gtt_model". Its column definitions must include at least those from the global temporary table for which histograms are to be built. The column names and types must exactly match those of the global temporary table.
Populate the persistent table with a set of rows, which is representative of a typical instance of the global temporary table.
Run optimizedb on those columns of the persistent table for which histograms are desired (typically, the columns contained in where clauses in any referencing queries).
After the histograms have been built, the persistent table can be emptied of rows, to release the space it occupies. This must be done with a "delete from xxx" statement, to delete the rows but leave the catalog definition (and histograms).

When the query optimizer analyzes where clause predicates with columns from a global temporary table, it looks for the catalog definition of a similarly named persistent table with a schema qualifier matching the ID of the executing user or _gtt_model. If one is found, it looks for histograms on similarly named columns whose type and length exactly match those of the global temporary table columns. If these conditions are satisfied, it uses the model histograms.

Not all faulty query plans involving global temporary tables can be improved this way. The modeling technique depends on the fact that all or most instances of the global temporary table have similar value distributions in the histogrammed columns. If this is not true, a single instance of the table (as with the model persistent table) will not be representative of them all, and can improve the query plans in some executions of the application, but degrade other executions.

When to Rerun Optimization

Optimization does not necessarily need to be run whenever data is changed or added to the database. Optimization collects statistics that represent percentages of data in ranges and repetition factors. For instance, the statistics collected on employee gender show that 49% of the employees are female and 51% are male. Unless this percentage shifts dramatically, there is no need to rerun optimization on this column, even if the total number of employees changes.

You must rerun optimization if there are modifications to the database that alter the following:

Repetition factor
Percentage of rows returned from a range qualification (that is, your histogram information is incorrect)

For example, if you had run complete statistics on the empno column early in your company's history, your repetition factor is correct because all employees still have unique employee numbers. If you used ranges of employee numbers in any way, as you added new employees your histogram information is less accurate.

If your company originally had 100 employees, 10% of the employees have employee numbers greater than 90. If the company hired an additional 100 employees, 55% of the employees have employee numbers greater than 90, but the original histogram information does not reflect this.

Columns that show this type of "receding end" growth and are used in range queries can periodically need to have optimization run on them (exact match on employee number is not affected, because the information that says all employee numbers are unique is still correct).

Even if the statistics are not up-to-date, the query results are still correct.

Example: Before and After Optimization

If statistics are available on a column referenced in a where clause, the query optimizer uses the information to choose the most efficient QEP. Understanding how this information is used can be helpful in analyzing query performance. For more information, see Query Execution Plans.

Two QEPs showing the effect of optimization are presented here. The first is a QEP before optimizing; the second shows the same query after optimization. The query used is a join, where both the r and s tables use the B-tree storage structure:

select * from r, s

where s.a > 4000 and r.a = s.a;

QEP Before Optimization

Before obtaining statistics, the optimizer chooses a full sort-merge (FSM) join, because it assumes that 10% of each table satisfies the qualification "a > 4000," as shown in the QEP diagram below:

QUERY PLAN 4,1, no timeout, of main query

FSM Join(a)

Heap

Pages 1 Tups 267

D15 C44

/ \

Proj-rest Proj-rest

Sorted(a) Sorted(a)

Pages 5 Tups 1235 Pages 1 Tups 267

D11 C12 D4 C3

/ /

r s

B-Tree(a) B-Tree (a)

Pages 172 Tups 12352 Pages 37 Tups 2666

QEP After Optimization

After obtaining statistics, the optimizer chooses a Key join, because only one row satisfies the qualification "a > 4000," as shown in the QEP diagram below:

QUERY PLAN 4,1, no timeout, of main query

K Join(a)

Heap

Pages 1 Tups 1

D4 C1

/ \

Proj-rest r

Sorted(a) B-Tree(a)

Pages 1 Tups 1 Pages 172 Tups 12352

D2 C1

B-Tree(a)

Pages 37 Tups 2666

The cost of the key join is significantly less than the cost of the FSM join because the join involves far fewer rows.