ANALYZE

Collects statistics about a database.

Synopsis

ANALYZE [VERBOSE] [ROOTPARTITION [ALL] ] 
   [table [ (column [, ...] ) ]]

Description

ANALYZE collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. Subsequently, Greenplum Database uses these statistics to help determine the most efficient execution plans for queries.

With no parameter, ANALYZE collects statistics for every table in the current database. You can specify a table name to collect statistics for a single table. You can specify a set of column names, in which case the statistics only for those columns are collected.

ANALYZE does not collect statistics on external tables.

Important: If you intend to execute queries on partitioned tables with the Pivotal Query Optimizer enabled, you must collect statistics on the root partition of the partitioned table with the ANALYZE ROOTPARTITION command. For information about the Pivotal Query Optimizer, see "Querying Data" in the Greenplum Database Administrator Guide.
Note: You can also use the Greenplum Database utility analyzedb to update table statistics. The analyzedb utility can update statistics for multiple tables concurrently. The utility can also check table statistics and update statistics only if the statistics are not current or do not exist. For information about the utility, see the Greenplum Database Utility Guide.

Parameters

ROOTPARTITION [ALL]
Collect statistics only on the root partition of partitioned tables. When you specify ROOTPARTITION, you must specify either ALL or the name of a partitioned table.
If you specify ALL with ROOTPARTITION, Greenplum Database collects statistics for the root partition of all partitioned tables in the database. If there are no partitioned tables in the database, a message stating that there are no partitioned tables is returned. For tables that are not partitioned tables, statistics are not collected. For tables that are not partitioned tables, statistics are not collected.
If you specify a table name with ROOTPARTITION and the table is not partitioned table, no statistics are collected for the table and a warning message is returned.
The ROOTPARTITION clause is not valid with VACUUM ANALYZE. The command VACUUM ANALYZE ROOTPARTITION returns an error.
For the partitioned table sales_curr_yr, this example command collects statistics only on the root partition of the partitioned table. ANALYZE ROOTPARTITION sales_curr_yr;
This example ANALYZE command collects statistics on the root partition of all the partitioned tables in the database.
ANALYZE ROOTPARTITION ALL;
VERBOSE
Enables display of progress messages. Enables display of progress messages. When specified, ANALYZE emits this information
  • The table that is being processed.
  • The query that is executed to generate the sample table.
  • The column for which statistics is being computed.
  • The queries that are issued to collect the different statistics for a single column.
  • The statistics that are generated.
table
The name (possibly schema-qualified) of a specific table to analyze. Defaults to all tables in the current database.
column
The name of a specific column to analyze. Defaults to all columns.

Notes

It is a good idea to run ANALYZE periodically, or just after making major changes in the contents of a table. Accurate statistics helps Greenplum Database choose the most appropriate query plan, and thereby improve the speed of query processing. A common strategy is to run VACUUM and ANALYZE once a day during a low-usage time of day.

ANALYZE requires only a read lock on the target table, so it can run in parallel with other activity on the table.

For a partitioned table, specifying which portion of the table to analyze, the root partition or subpartitions (leaf child tables) can be useful if the partitioned table has large number of partitions that have been analyzed and only a few leaf child tables have changed.

  • When you run ANALYZE on the root partitioned table, statistics are collected for all the leaf child tables (the lowest-level tables in the hierarchy of child tables created by Greenplum Database for use by the partitioned table).
  • When you run ANALYZE on a leaf child table, statistics are collected only for that leaf child table. When you run ANALYZE on a child table that is not a leaf child table, statistics are not collected.

    For example, you can create a partitioned table with partitions for the years 2000 to 2010 and subpartitions for each month in each year. If you run ANALYZE on the child table for the year 2005 no statistics are collected. If you run ANALYZE on the leaf child table for March of 2005, statistics are collected only for that leaf child table.

    Note: When you create a partitioned table with the CREATE TABLE command, Greenplum Database creates the table that you specify (the root partition or parent table), and also creates a hierarchy of tables based on the partition hierarchy that you specified (the child tables). Partitioned tables, child tables and their inheritance level relationships are tracked in the system view pg_partitions.

For a partitioned table that contains a leaf child partition that has been exchanged to use an external table, ANALYZE does not collect statistics for the external table partition:

  • If ANALYZE [ROOTPARTITION] is run, external table partitions are not analyzed and root table statistics do not include external table partition.
  • If ANALYZE is run on an external table partition, the partition is not analyzed.
  • If the VERBOSE clause is specified, an informational message is displayed: skipping external table.

The statistics collected by ANALYZE usually include a list of some of the most common values in each column and a histogram showing the approximate data distribution in each column. One or both of these may be omitted if ANALYZE deems them uninteresting (for example, in a unique-key column, there are no common values) or if the column data type does not support the appropriate operators.

For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. Note, however, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This may result in small changes in the planner's estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the query optimizer to choose a different query plan between runs of ANALYZE. To avoid this, raise the amount of statistics collected by ANALYZE by adjusting the default_statistics_target configuration parameter, or on a column-by-column basis by setting the per-column statistics target with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS (see ALTER TABLE). The target value sets the maximum number of entries in the most-common-value list and the maximum number of bins in the histogram. The default target value is 10, but this can be adjusted up or down to trade off accuracy of planner estimates against the time taken for ANALYZE and the amount of space occupied in pg_statistic. In particular, setting the statistics target to zero disables collection of statistics for that column. It may be useful to do that for columns that are never used as part of the WHERE, GROUP BY, or ORDER BY clauses of queries, since the planner will have no use for statistics on such columns.

The largest statistics target among the columns being analyzed determines the number of table rows sampled to prepare the statistics. Increasing the target causes a proportional increase in the time and space needed to do ANALYZE.

Examples

Collect statistics for the table mytable:

ANALYZE mytable;

Compatibility

There is no ANALYZE statement in the SQL standard.

See Also

ALTER TABLE, EXPLAIN, VACUUM, analyzedb utility in the Greenplum Database Utility Guide.