Initializing a Greenplum Database System
This chapter describes how to initialize a Greenplum Database database system. The instructions in this chapter assume you have already installed the Greenplum Database software on all of the hosts in the system according to the instructions in Configuring Your Systems and Installing Greenplum.
This chapter contains the following topics:
Overview
Because Greenplum Database is distributed, the process for initializing a Greenplum Database management system (DBMS) involves initializing several individual PostgreSQL database instances (called segment instances in Greenplum).
Each database instance (the master and all segments) must be initialized across all of the hosts in the system in such a way that they can all work together as a unified DBMS. Greenplum provides its own version of initdb called gpinitsystem, which takes care of initializing the database on the master and on each segment instance, and starting each instance in the correct order.
After the Greenplum Database database system has been initialized and started, you can then create and manage databases as you would in a regular PostgreSQL DBMS by connecting to the Greenplum master.
Initializing Greenplum Database
These are the high-level tasks for initializing Greenplum Database:
- Make sure you have completed all of the installation tasks described in Configuring Your Systems and Installing Greenplum.
- Create a host file that contains the host addresses of your segments. See Creating the Initialization Host File.
- Create your Greenplum Database system configuration file. See Creating the Greenplum Database Configuration File.
- By default, Greenplum Database will be initialized using the locale of the master host system. Make sure this is the correct locale you want to use, as some locale options cannot be changed after initialization. See Configuring Localization Settings for more information.
- Run the Greenplum Database initialization utility on the master host. See Running the Initialization Utility.
Creating the Initialization Host File
The gpinitsystem utility requires a host file that contains the list of addresses for each segment host. The initialization utility determines the number of segment instances per host by the number host addresses listed per host times the number of data directory locations specified in the gpinitsystem_config file.
This file should only contain segment host addresses (not the master or standby master). For segment machines with more than one network interface, this file should list the host address names for each interface — one per line.
To create the initialization host file
- Log in as
gpadmin.
$ su - gpadmin
- Create a file named hostfile_gpinitsystem. In this
file add the host address name(s) of your segment host interfaces, one name per
line, no extra lines or spaces. For example, if you have four segment hosts with two
network interfaces
each:
sdw1-1 sdw1-2 sdw2-1 sdw2-2 sdw3-1 sdw3-2 sdw4-1 sdw4-2
- Save and close the file.
Creating the Greenplum Database Configuration File
Your Greenplum Database configuration file tells the gpinitsystem utility how you want to configure your Greenplum Database system. An example configuration file can be found in $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_config.
To create a gpinitsystem_config file
- Log in as
gpadmin.
$ su - gpadmin
- Make a copy of the gpinitsystem_config file to use as
a starting point. For
example:
$ cp $GPHOME/docs/cli_help/gpconfigs/gpinitsystem_config /home/gpadmin/gpconfigs/gpinitsystem_config
- Open the file you just copied in a text editor.
Set all of the required parameters according to your environment. See gpinitsystem for more information. A Greenplum Database system must contain a master instance and at least two segment instances (even if setting up a single node system).
The DATA_DIRECTORY parameter is what determines how many segments per host will be created. If your segment hosts have multiple network interfaces, and you used their interface address names in your host file, the number of segments will be evenly spread over the number of available interfaces.
Here is an example of the required parameters in the gpinitsystem_config file:
ARRAY_NAME="EMC Greenplum DW" SEG_PREFIX=gpseg PORT_BASE=40000 declare -a DATA_DIRECTORY=(/data1/primary /data1/primary /data1/primary /data2/primary /data2/primary /data2/primary) MASTER_HOSTNAME=mdw MASTER_DIRECTORY=/data/master MASTER_PORT=5432 TRUSTED SHELL=ssh CHECK_POINT_SEGMENT=8 ENCODING=UNICODE
- (Optional) If you want to deploy mirror segments, uncomment and set
the mirroring parameters according to your environment. Here is an example of the
optional mirror parameters in the gpinitsystem_config
file:
MIRROR_PORT_BASE=50000 REPLICATION_PORT_BASE=41000 MIRROR_REPLICATION_PORT_BASE=51000 declare -a MIRROR_DATA_DIRECTORY=(/data1/mirror /data1/mirror /data1/mirror /data2/mirror /data2/mirror /data2/mirror)
Note: You can initialize your Greenplum system with primary segments only and deploy mirrors later using the gpaddmirrors utility. - Save and close the file.
Running the Initialization Utility
The gpinitsystem utility will create a Greenplum Database system using the values defined in the configuration file.
To run the initialization utility
- Run the following command referencing the path and file name of your
initialization configuration file (gpinitsystem_config) and host file
(hostfile_gpinitsystem). For
example:
$ cd ~ $ gpinitsystem -c gpconfigs/gpinitsystem_config -h gpconfigs/hostfile_gpinitsystem
For a fully redundant system (with a standby master and a spread mirror configuration) include the -s and -S options. For example:
$ gpinitsystem -c gpconfigs/gpinitsystem_config -h gpconfigs/hostfile_gpinitsystem -s standby_master_hostname -S
- The utility will verify your setup information and make sure it can
connect to each host and access the data directories specified in your configuration.
If all of the pre-checks are successful, the utility will prompt you to confirm your
configuration. For
example:
=> Continue with Greenplum creation? Yy/Nn
- Press y to start the initialization.
- The utility will then begin setup and initialization of the master instance and each segment instance in the system. Each segment instance is set up in parallel. Depending on the number of segments, this process can take a while.
- At the end of a successful setup, the utility will start your
Greenplum Database system. You should
see:
=> Greenplum Database instance successfully created.
Troubleshooting Initialization Problems
If the utility encounters any errors while setting up an instance, the entire process will fail, and could possibly leave you with a partially created system. Refer to the error messages and logs to determine the cause of the failure and where in the process the failure occurred. Log files are created in ~/gpAdminLogs.
Depending on when the error occurred in the process, you may need to clean up and then try the gpinitsystem utility again. For example, if some segment instances were created and some failed, you may need to stop postgres processes and remove any utility-created data directories from your data storage area(s). A backout script is created to help with this cleanup if necessary.
Using the Backout Script
If the gpinitsystem utility fails, it will create the following backout script if it has left your system in a partially installed state:
~/gpAdminLogs/backout_gpinitsystem_< user >_< timestamp >
You can use this script to clean up a partially created Greenplum Database system. This backout script will remove any utility-created data directories, postgres processes, and log files. After correcting the error that caused gpinitsystem to fail and running the backout script, you should be ready to retry initializing your Greenplum Database array.
The following example shows how to run the backout script:
$ sh backout_gpinitsystem_gpadmin_20071031_121053
Setting Greenplum Environment Variables
You must configure your environment on the Greenplum Database master (and standby master). A greenplum_path.sh file is provided in your $GPHOME directory with environment variable settings for Greenplum Database. You can source this file in the gpadmin user's startup shell profile (such as .bashrc).
The Greenplum Database management utilities also require that the MASTER_DATA_DIRECTORY environment variable be set. This should point to the directory created by the gpinitsystem utility in the master data directory location.
To set up your user environment for Greenplum
- Make sure you are logged in as
gpadmin:
$ su - gpadmin
- Open your profile file (such as .bashrc) in a text
editor. For example:
$ vi ~/.bashrc
- Add lines to this file to source the greenplum_path.sh
file and set the MASTER_DATA_DIRECTORY environment variable. For
example:
source /usr/local/greenplum-db/greenplum_path.sh export MASTER_DATA_DIRECTORY=/data/master/gpseg-1
- (Optional) You may also want to set some client session environment
variables such as PGPORT, PGUSER and
PGDATABASE for convenience. For
example:
export PGPORT=5432 export PGUSER=gpadmin export PGDATABASE=default_login_database_name
- Save and close the file.
- After editing the profile file, source it to make the changes active.
For example:
$ source ~/.bashrc
- If you have a standby master host, copy your environment file to the
standby master as well. For
example:
$ cd ~ $ scp .bashrc standby_hostname:`pwd`
Next Steps
After your system is up and running, the next steps are:
Allowing Client Connections
After a Greenplum Database is first initialized it will only allow local connections to the database from the gpadmin role (or whatever system user ran gpinitsystem). If you would like other users or client machines to be able to connect to Greenplum Database, you must give them access. See the Greenplum Database Administrator Guide for more information.
Creating Databases and Loading Data
After verifying your installation, you may want to begin creating databases and loading data. See the Greenplum Database Administrator Guide for more information about creating databases, schemas, tables, and other database objects in Greenplum Database and loading your data.