CREATE EXTERNAL TABLE

Defines a new external table.

Synopsis

CREATE [READABLE] EXTERNAL TABLE table_name     
    ( column_name data_type [, ...] | LIKE other_table )
      LOCATION ('file://seghost[:port]/path/file' [, ...])
        | ('gpfdist://filehost[:port]/file_pattern[#transform]'
        | ('gpfdists://filehost[:port]/file_pattern[#transform]'
            [, ...])
        | ('gphdfs://hdfs_host[:port]/path/file')
      FORMAT 'TEXT' 
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'AVRO' 
           | 'PARQUET'

           | 'CUSTOM' (Formatter=<formatter specifications>)
     [ ENCODING 'encoding' ]
     [ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]

CREATE [READABLE] EXTERNAL WEB TABLE table_name     
   ( column_name data_type [, ...] | LIKE other_table )
      LOCATION ('http://webhost[:port]/path/file' [, ...])
    | EXECUTE 'command' [ON ALL 
                          | MASTER
                          | number_of_segments
                          | HOST ['segment_hostname'] 
                          | SEGMENT segment_id ]
      FORMAT 'TEXT' 
            [( [HEADER]
               [DELIMITER [AS] 'delimiter' | 'OFF']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CSV'
            [( [HEADER]
               [QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE NOT NULL column [, ...]]
               [ESCAPE [AS] 'escape']
               [NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
               [FILL MISSING FIELDS] )]
           | 'CUSTOM' (Formatter=<formatter specifications>)
     [ ENCODING 'encoding' ]
     [ [LOG ERRORS [INTO error_table]] SEGMENT REJECT LIMIT count
       [ROWS | PERCENT] ]

CREATE WRITABLE EXTERNAL TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
     LOCATION('gpfdist://outputhost[:port]/filename[#transform]'
      | ('gpfdists://outputhost[:port]/file_pattern[#transform]'
          [, ...])
      | ('gphdfs://hdfs_host[:port]/path')
      FORMAT 'TEXT' 
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]
           | 'AVRO' 
           | 'PARQUET'

           | 'CUSTOM' (Formatter=<formatter specifications>)
    [ ENCODING 'write_encoding' ]
    [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]

CREATE WRITABLE EXTERNAL WEB TABLE table_name
    ( column_name data_type [, ...] | LIKE other_table )
    EXECUTE 'command' [ON ALL]
    FORMAT 'TEXT' 
               [( [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [ESCAPE [AS] 'escape' | 'OFF'] )]
          | 'CSV'
               [([QUOTE [AS] 'quote'] 
               [DELIMITER [AS] 'delimiter']
               [NULL [AS] 'null string']
               [FORCE QUOTE column [, ...]] ]
               [ESCAPE [AS] 'escape'] )]
           | 'CUSTOM' (Formatter=<formatter specifications>)
    [ ENCODING 'write_encoding' ]
    [ DISTRIBUTED BY (column, [ ... ] ) | DISTRIBUTED RANDOMLY ]

Description

See "Loading and Unloading Data" in the Greenplum Database Administrator Guide for detailed information about external tables.

CREATE EXTERNAL TABLE or CREATE EXTERNAL WEB TABLE creates a new readable external table definition in Greenplum Database. Readable external tables are typically used for fast, parallel data loading. Once an external table is defined, you can query its data directly (and in parallel) using SQL commands. For example, you can select, join, or sort external table data. You can also create views for external tables. DML operations (UPDATE, INSERT, DELETE, or TRUNCATE) are not allowed on readable external tables, and you cannot create indexes on readable external tables.

CREATE WRITABLE EXTERNAL TABLE or CREATE WRITABLE EXTERNAL WEB TABLE creates a new writable external table definition in Greenplum Database. Writable external tables are typically used for unloading data from the database into a set of files or named pipes. Writable external web tables can also be used to output data to an executable program. Writable external tables can also be used as output targets for Greenplum parallel MapReduce calculations. Once a writable external table is defined, data can be selected from database tables and inserted into the writable external table. Writable external tables only allow INSERT operations – SELECT, UPDATE, DELETE or TRUNCATE are not allowed.

The main difference between regular external tables and web external tables is their data sources. Regular readable external tables access static flat files, whereas web external tables access dynamic data sources – either on a web server or by executing OS commands or scripts.

The FORMAT clause is used to describe how the external table files are formatted. Valid file formats are delimited text (TEXT) and comma separated values (CSV) format, similar to the formatting options available with the PostgreSQL COPY command. If the data in the file does not use the default column delimiter, escape character, null string and so on, you must specify the additional formatting options so that the data in the external file is read correctly by Greenplum Database. For information about using a custom format, see "Loading and Unloading Data" in the Greenplum Database Administrator Guide.

For the gphdfs protocol, you can also specify the AVRO or PARQUET in the FORMAT clause to read or write Avro or Parquet format files. For information about Avro or Parquet file, see HDFS File Format Support for the gphdfs Protocol

Parameters

READABLE | WRITABLE
Specifiies the type of external table, readable being the default. Readable external tables are used for loading data into Greenplum Database. Writable external tables are used for unloading data.
WEB
Creates a readable or wrtiable web external table definition in Greenplum Database. There are two forms of readable web external tables – those that access files via the http:// protocol or those that access data by executing OS commands. Writable web external tables output data to an executable program that can accept an input stream of data. Web external tables are not rescannable during query execution.
table_name
The name of the new external table.
column_name
The name of a column to create in the external table definition. Unlike regular tables, external tables do not have column constraints or default values, so do not specify those.
LIKE other_table
The LIKE clause specifies a table from which the new external table automatically copies all column names, data types and Greenplum distribution policy. If the original table specifies any column constraints or default column values, those will not be copied over to the new external table definition.
data_type
The data type of the column.
LOCATION ('protocol://host[:port]/path/file' [, ...])
For readable external tables, specifies the URI of the external data source(s) to be used to populate the external table or web table. Regular readable external tables allow the gpfdist or file protocols. Web external tables allow the http protocol. If port is omitted, port 8080 is assumed for http and gpfdist protocols, and port 9000 for the gphdfs protocol. If using the gpfdist protocol, the path is relative to the directory from which gpfdist is serving files (the directory specified when you started the gpfdist program). Also, gpfdist can use wildcards (or other C-style pattern matching) to denote multiple files in a directory. For example:
'gpfdist://filehost:8081/*'
'gpfdist://masterhost/my_load_file'
'file://seghost1/dbfast1/external/myfile.txt'
'http://intranet.mycompany.com/finance/expenses.csv'
If you are using MapR clusters with the gphdfs protocol, you specify a specific cluster and the file:
  • To specify the default cluster, the first entry in the MapR configuration file /opt/mapr/conf/mapr-clusters.conf, specify the location of your table with this syntax:
     LOCATION ('gphdfs:///file_path')
    The file_path is the path to the file.
  • To specify another MapR cluster listed in the configuration file, specify the file with this syntax:
     LOCATION ('gphdfs:///mapr/cluster_name/file_path')
    The cluster_name is the name of the cluster specified in the configuration file and file_path is the path to the file.
For writable external tables, specifies the URI location of the gpfdist process that will collect data output from the Greenplum segments and write it to the named file. The path is relative to the directory from which gpfdist is serving files (the directory specified when you started the gpfdist program). If multiple gpfdist locations are listed, the segments sending data will be evenly divided across the available output locations. For example:
'gpfdist://outputhost:8081/data1.out',
'gpfdist://outputhost:8081/data2.out'
With two gpfdist locations listed as in the above example, half of the segments would send their output data to the data1.out file and the other half to the data2.out file.
If you specify gphdfs protocol to read or write file to a Hadoop file system (HDFS), you can read or write an Avro or Parquet format file by specifying the FORMAT clause with either AVRO or PARQUET.
For information about the options when specifying a location for an Avro or Parquet file, see HDFS File Format Support for the gphdfs Protocol.
EXECUTE 'command' [ON ...]
Allowed for readable web external tables or writable external tables only. For readable web external tables, specifies the OS command to be executed by the segment instances. The command can be a single OS command or a script. The ON clause is used to specify which segment instances will execute the given command.
  • ON ALL is the default. The command will be executed by every active (primary) segment instance on all segment hosts in the Greenplum Database system. If the command executes a script, that script must reside in the same location on all of the segment hosts and be executable by the Greenplum superuser (gpadmin).
  • ON MASTER runs the command on the master host only.
    Note: Logging is not supported for web external tables when the ON MASTER clause is specified.
  • ON number means the command will be executed by the specified number of segments. The particular segments are chosen randomly at runtime by the Greenplum Database system. If the command executes a script, that script must reside in the same location on all of the segment hosts and be executable by the Greenplum superuser (gpadmin).
  • HOST means the command will be executed by one segment on each segment host (once per segment host), regardless of the number of active segment instances per host.
  • HOST segment_hostname means the command will be executed by all active (primary) segment instances on the specified segment host.
  • SEGMENT segment_id means the command will be executed only once by the specified segment. You can determine a segment instance's ID by looking at the content number in the system catalog table gp_segment_configuration. The content ID of the Greenplum Database master is always -1.

For writable external tables, the command specified in the EXECUTE clause must be prepared to have data piped into it. Since all segments that have data to send will write their output to the specified command or program, the only available option for the ON clause is ON ALL.

FORMAT 'TEXT | CSV | AVRO | PARQUET' (options)
Specifies the format of the external or web table data - either plain text (TEXT) or comma separated values (CSV) format.
The AVRO and PARQUET formats are supported only with the gphdfs protocol.
For information about the options when specifying the AVRO and PARQUET file format, see HDFS File Format Support for the gphdfs Protocol.
DELIMITER
Specifies a single ASCII character that separates columns within each row (line) of data. The default is a tab character in TEXT mode, a comma in CSV mode. In TEXT mode for readable external tables, the delimiter can be set to OFF for special use cases in which unstructured data is loaded into a single-column table.
NULL
Specifies the string that represents a NULL value. The default is \N (backslash-N) in TEXT mode, and an empty value with no quotations in CSV mode. You might prefer an empty string even in TEXT mode for cases where you do not want to distinguish NULL values from empty strings. When using external and web tables, any data item that matches this string will be considered a NULL value.
As an example for the text format, this FORMAT clause can be used to specify that the string of two single quotes ('') is a NULL value.
FORMAT 'text' (delimiter ',' null '\'\'\'\'' )
ESCAPE
Specifies the single character that is used for C escape sequences (such as \n,\t,\100, and so on) and for escaping data characters that might otherwise be taken as row or column delimiters. Make sure to choose an escape character that is not used anywhere in your actual column data. The default escape character is a \ (backslash) for text-formatted files and a " (double quote) for csv-formatted files, however it is possible to specify another character to represent an escape. It is also possible to disable escaping in text-formatted files by specifying the value 'OFF' as the escape value. This is very useful for data such as text-formatted web log data that has many embedded backslashes that are not intended to be escapes.
NEWLINE
Specifies the newline used in your data files – LF (Line feed, 0x0A), CR (Carriage return, 0x0D), or CRLF (Carriage return plus line feed, 0x0D 0x0A). If not specified, a Greenplum Database segment will detect the newline type by looking at the first row of data it receives and using the first newline type encountered.
HEADER
For readable external tables, specifies that the first line in the data file(s) is a header row (contains the names of the table columns) and should not be included as data for the table. If using multiple data source files, all files must have a header row.
QUOTE
Specifies the quotation character for CSV mode. The default is double-quote (").
FORCE NOT NULL
In CSV mode, processes each specified column as though it were quoted and hence not a NULL value. For the default null string in CSV mode (nothing between two delimiters), this causes missing values to be evaluated as zero-length strings.
FORCE QUOTE
In CSV mode for writable external tables, forces quoting to be used for all non-NULL values in each specified column. NULL output is never quoted.
FILL MISSING FIELDS
In both TEXT and CSV mode for readable external tables, specifying FILL MISSING FIELDS will set missing trailing field values to NULL (instead of reporting an error) when a row of data has missing data fields at the end of a line or row. Blank rows, fields with a NOT NULL constraint, and trailing delimiters on a line will still report an error.
ENCODING 'encoding'
Character set encoding to use for the external table. Specify a string constant (such as 'SQL_ASCII'), an integer encoding number, or DEFAULT to use the default client encoding. See Character Set Support.
LOG ERRORS [INTO error_table]
This is an optional clause that can precede a SEGMENT REJECT LIMIT clause to log information about rows with formatting errors. The optional INTO error_table clause specifies an error table where rows with formatting errors will be logged when running in single row error isolation mode.
If the INTO error_table clause is not specified, the error log information is stored internally (not in an error table). Error log information that is stored internally is accessed with the Greenplum Database built-in SQL function gp_read_error_log().
If the error_table specified already exists, it is used. If it does not exist, it is created. If error_table exists and does not have a random distribution (the DISTRIBUTED RANDOMLY clause was not specified when creating the table), an error is returned.
See Notes for information about the error log information and built-in functions for viewing and managing error log information.
The error_table cannot be modified with the ALTER TABLE command.
Note: The optional INTO error_table clause is deprecated and will not be supported in the next major release. Only internal error logs will be supported.
SEGMENT REJECT LIMIT count [ROWS | PERCENT]
Runs a COPY FROM operation in single row error isolation mode. If the input rows have format errors they will be discarded provided that the reject limit count is not reached on any Greenplum segment instance during the load operation. The reject limit count can be specified as number of rows (the default) or percentage of total rows (1-100). If PERCENT is used, each segment starts calculating the bad row percentage only after the number of rows specified by the parameter gp_reject_percent_threshold has been processed. The default for gp_reject_percent_threshold is 300 rows. Constraint errors such as violation of a NOT NULL, CHECK, or UNIQUE constraint will still be handled in "all-or-nothing" input mode. If the limit is not reached, all good rows will be loaded and any error rows discarded.
Note: When reading an external table, Greenplum Database limits the initial number of rows that can contain formatting errors if the SEGMENT REJECT LIMIT is not triggered first or is not specified. If the first 1000 rows are rejected, the COPY operation is stopped and rolled back.

The limit for the number of initial rejected rows can be changed with the Greenplum Database server configuration parameter gp_initial_bad_row_limit. See Server Configuration Parameters for information about the parameter.

DISTRIBUTED BY (column, [ ... ] )
DISTRIBUTED RANDOMLY
Used to declare the Greenplum Database distribution policy for a writable external table. By default, writable external tables are distributed randomly. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table will improve unload performance by eliminating the need to move rows over the interconnect. When you issue an unload command such as INSERT INTO wex_table SELECT * FROM source_table, the rows that are unloaded can be sent directly from the segments to the output location if the two tables have the same hash distribution policy.

Examples

Start the gpfdist file server program in the background on port 8081 serving files from directory /var/data/staging:

gpfdist -p 8081 -d /var/data/staging -l /home/gpadmin/log &

Create a readable external table named ext_customer using the gpfdist protocol and any text formatted files (*.txt) found in the gpfdist directory. The files are formatted with a pipe (|) as the column delimiter and an empty space as NULL. Also access the external table in single row error isolation mode:

CREATE EXTERNAL TABLE ext_customer
   (id int, name text, sponsor text) 
   LOCATION ( 'gpfdist://filehost:8081/*.txt' ) 
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   LOG ERRORS INTO err_customer SEGMENT REJECT LIMIT 5;

Create the same readable external table definition as above, but with CSV formatted files:

CREATE EXTERNAL TABLE ext_customer 
   (id int, name text, sponsor text) 
   LOCATION ( 'gpfdist://filehost:8081/*.csv' ) 
   FORMAT 'CSV' ( DELIMITER ',' );

Create a readable external table named ext_expenses using the file protocol and several CSV formatted files that have a header row:

CREATE EXTERNAL TABLE ext_expenses (name text, date date, 
amount float4, category text, description text) 
LOCATION ( 
'file://seghost1/dbfast/external/expenses1.csv',
'file://seghost1/dbfast/external/expenses2.csv',
'file://seghost2/dbfast/external/expenses3.csv',
'file://seghost2/dbfast/external/expenses4.csv',
'file://seghost3/dbfast/external/expenses5.csv',
'file://seghost3/dbfast/external/expenses6.csv' 
)
FORMAT 'CSV' ( HEADER );

Create a readable web external table that executes a script once per segment host:

CREATE EXTERNAL WEB TABLE log_output (linenum int, message 
text)  EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST 
 FORMAT 'TEXT' (DELIMITER '|');

Create a writable external table named sales_out that uses gpfdist to write output data to a file named sales.out. The files are formatted with a pipe (|) as the column delimiter and an empty space as NULL.

CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales) 
   LOCATION ('gpfdist://etl1:8081/sales.out')
   FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
   DISTRIBUTED BY (txn_id);

Create a writable external web table that pipes output data received by the segments to an executable script named to_adreport_etl.sh:

CREATE WRITABLE EXTERNAL WEB TABLE campaign_out 
(LIKE campaign) 
 EXECUTE '/var/unload_scripts/to_adreport_etl.sh'
 FORMAT 'TEXT' (DELIMITER '|');

Use the writable external table defined above to unload selected data:

INSERT INTO campaign_out SELECT * FROM campaign WHERE 
customer_id=123;

Notes

Greenplum database can log information about rows with formatting errors in an error table or internally. When you specify LOG ERRORS INTO error_table, Greenplum Database creates the table error_table that contains errors that occur while reading the external table. The table is defined as follows:

CREATE TABLE error_table_name ( cmdtime timestamptz, relname text, 
    filename text, linenum int, bytenum int, errmsg text, 
    rawdata text, rawbytes bytea ) DISTRIBUTED RANDOMLY;

You can view the information in the table with SQL commands.

For error log data that is stored internally when the INTO error_table is not specified:
  • Use the built-in SQL function gp_read_error_log('table_name'). It requires SELECT privilege on table_name.

    This example displays the error log information for data copied into the table ext_expenses:

    SELECT * from gp_read_error_log('ext_expenses');

    The error log contains the same columns as the error table.

    The function returns FALSE if table_name does not exist.

  • If error log data exists for the specified table, the new error log data is appended to existing error log data. The error log information is not replicated to mirror segments.
  • Use the built-in SQL function gp_truncate_error_log('table_name') to delete the error log data for table_name. It requires the table owner privilege. This example deletes the error log information captured when moving data into the table ext_expenses:
    SELECT gp_truncate_error_log('ext_expenses'); 

    The function returns FALSE if table_name does not exist.

    Specify the * wildcard character to delete error log information for existing tables in the current database. Specify the string *.* to delete all database error log information, including error log information that was not deleted due to previous database issues. If * is specified, database owner privilege is required. If *.* is specified, database super-user privilege is required.

HDFS File Format Support for the gphdfs Protocol

If you specify gphdfs protocol to read or write file to a Hadoop file system (HDFS), you can read or write an Avro or Parquet format file by specifying the file format with the FORMAT clause.

To read data from or write data to an Avro or Parquet file, you create an external table with the CREATE EXTERNAL TABLE command and specify the location of the Avro file in the LOCATION clause and 'AVRO' in the FORMAT clause. This example is for a readable external table that reads from an Avro file.
CREATE EXTERNAL TABLE tablename (column_spec) LOCATION ( 'gphdfs://location') FORMAT 'AVRO' 
The location can be a file name or a directory containing a set of files. For the file name you can specify the wildcard character * to match any number of characters. If the location specifies multiple files when reading files, Greenplum Database uses the schema in the first file that is read as the schema for the other files.

As part of the location parameter you can specify options for reading or writing the file. After the file name, you can specify parameters with the HTTP query string syntax that starts with ? and uses & between field value pairs.

For this example location parameter, this URI sets compression parameters for an Avro format writeable external table.
'gphdfs://myhdfs:8081/avro/singleAvro/array2.avro?compress=true&compression_type=block&codec=snappy' FORMAT 'AVRO'

See "Loading and Unloading Data" in the Greenplum Database Administrator Guide for information about reading and writing the Avro and Parquet format files with external tables.

Avro Files

For readable external tables, the only valid parameter is schema. When reading multiple Avro files, you can specify a file that contains an Avro schema. See "Avro Schema Overrides" in the Greenplum Database

For writable external tables, you can specify schema, namespace, and parameters for compression.

Table 1. Avro Format External Table location Parameters
Parameter Value Readable/Writeable Default Value
schema URL_to_schema_file Read and Write None.
For a readable external table
  • The specified schema overrides the schema in the Avro file. See "Avro Schema Overrides"
  • If not specified, Greenplum Database uses the Avro file schema.
For a writeable external table
  • Uses the specified schema when creating the Avro file.
  • If not specified, Greenplum Database creates a schema according to the external table definition.
namespace avro_namespace Write only public.avro

If specified, a valid Avro namespace.

compress true or false Write only false
compression_type block Write only Optional.

For avro format, compression_type must be block if compress is true.

codec deflate or snappy Write only deflate
codec_level (deflate codec only) integer between 1 and 9 Write only 6

The level controls the trade-off between speed and compression. Valid values are 1 to 9, where 1 is the fastest and 9 is the most compressed.

This set of parameters specify snappy compression:
 'compress=true&codec=snappy'

These two sets of parameters specify deflate compression and are equivalent:

 'compress=true&codec=deflate&codec_level=1'
 'compress=true&codec_level=1'

Parquet Files

For external tables, you can add parameters after the file specified in the location. This table lists the valid parameters and values.

Table 2. Parquet Format External Table location Parameters
Option Values Readable/Writeable Default Value
schema URL_to_schema Write only None.

If not specified, Greenplum Database creates a schema according to the external table definition.

pagesize > 1024 Bytes Write only 1 MB
rowgroupsize > 1024 Bytes Write only 8 MB
version v1, v2 Write only v1
codec UNCOMPRESSED, GZIP, LZO, snappy Write only UNCOMPRESSED
dictionaryenable1 true, false Write only false
dictionarypagesize1 > 1024 Bytes Write only 512 KB
Note:
  1. Creates an internal dictionary. Enabling a dictionary might improve Parquet file compression if text columns contain similar or duplicate data.

Compatibility

CREATE EXTERNAL TABLE is a Greenplum Database extension. The SQL standard makes no provisions for external tables.