Checking for Uneven Data Distribution
All tables in Greenplum Database are distributed, meaning their data is divided across all of the segments in the system. If the data is not distributed evenly, then query processing performance may suffer. The following views can help diagnose if a table has uneven data distribution:
gp_skew_coefficients
This view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access
| Column | Description |
|---|---|
| skcoid | The object id of the table. |
| skcnamespace | The namespace where the table is defined. |
| skcrelname | The table name. |
| skccoeff | The coefficient of variation (CV) is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew. |
gp_skew_idle_fractions
This view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of processing data skew. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access
| Column | Description |
|---|---|
| sifoid | The object id of the table. |
| sifnamespace | The namespace where the table is defined. |
| sifrelname | The table name. |
| siffraction | The percentage of the system that is idle during a table scan, which is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated. |