Migrating Data from Greenplum 4.3 or 5 to Greenplum 6
You can migrate data from Greenplum Database 4.3 or 5 to Greenplum 6 using the standard backup and restore procedures,
gprestore, or by using
gpcopy for Tanzu Greenplum.
Note: Open source Greenplum Database is available only for Greenplum Database 5 and later.
Note: You can upgrade a Greenplum Database 5.28 system directly to Greenplum 6.9 or later using gpupgrade. You cannot upgrade a Greenplum Database 4.3 system directly to Greenplum 6.
This topic identifies known issues you may encounter when moving data from Greenplum 4.3 to Greenplum 6. You can work around these problems by making needed changes to your Greenplum 4.3 databases so that you can create backups that can be restored successfully to Greenplum 6.
- Preparing the Greenplum 6 Cluster
- Preparing Greenplum 4.3 and 5 Databases for Backup
- Backing Up and Restoring a Database
- Completing the Migration
Parent topic:Upgrading to Greenplum 6
Install and initialize a new Greenplum Database 6 cluster using the version 6
gprestoreonly supports restoring data to a cluster that has an identical number of hosts and an identical number of segments per host, with each segment having the same
content_idas the segment in the original cluster. Use
gpcopy(Tanzu Greenplum) if you need to migrate data to a different-sized Greenplum 6 cluster.
Note: Set the Greenplum Database 6 timezone to a value that is compatible with your host systems. Setting the Greenplum Database timezone prevents Greenplum Database from selecting a timezone each time the cluster is restarted. See Configuring Timezone and Localization Settings for more information.
If you intend to install Greenplum Database 6 on the same hardware as your 4.3 system, you will need enough disk space to accommodate over five times the original data set (two full copies of the primary and mirror data sets, plus the original backup data in ASCII format) in order to migrate data with
gprestore. Keep in mind that the ASCII backup data will require more disk space than the original data, which may be stored in compressed binary format. Offline backup solutions such as Dell EMC Data Domain can reduce the required disk space on each host.
If you want to migrate your data on the same hardware but do not have enough free disk space on your host systems,
gpcopyfor Tanzu Greenplum provides the
--truncate-source-afteroption to truncate each source table after copying the table to the destination cluster and validating that the copy succeeded. This reduces the amount of free space needed to migrate clusters that reside on the same hardware. See Migrating Data with gpcopy for more information.
Install any external modules used in your Greenplum 4.3 system in the Greenplum 6 system before you restore the backup, for example MADlib or PostGIS. If versions of the external modules are not compatible, you may need to exclude tables that reference them when restoring the Greenplum 4.3 backup to Greenplum 6.
The Greenplum 4.3 Oracle Compatibility Functions module is not compatible with Greenplum 6. Uninstall the module from Greenplum 4.3 by running the
uninstall_orafunc.sqlscript before you back up your databases:
You will also need to drop any dependent database objects that reference compatibility functions.
Install the Oracle Compatibility Functions in Greenplum 6 by creating the
orafcemodule in each database where you require the functions:
$ psql -d <dbname> -c 'CREATE EXTENSION orafce'
See Installing Additional Supplied Modules for information about installing
orafceand other modules.
When restoring language-based user-defined functions, the shared object file must be in the location specified in the
CREATE FUNCTIONSQL command and must have been recompiled on the Greenplum 6 system. This applies to user-defined functions, user-defined types, and any other objects that use custom functions, such as aggregates created with the
Greenplum 6 provides resource groups, an alternative to managing resources using resource queues. Setting the
gp_resource_managerserver configuration parameter to
groupselects the resource management scheme the Greenplum Database system will use. The default is
queue, so no action is required when you move from Greenplum version 4.3 to version 6. To more easily transition from resource queues to resource groups, you can set resource groups to allocate and manage memory in a way that is similar to resource queue memory management. To select this feature, set the
MEMORY_SPILL_RATIOattributes of your resource groups to 0. See Using Resource Groups for information about enabling and configuring resource groups.
Filespaces are removed in Greenplum 6. When creating a tablespace, note that different tablespace locations for a primary-mirror pair is no longer supported in Greenplum 6. See Creating and Managing Tablespaces for information about creating and configuring tablespaces.
Note: A Greenplum 4 system must be at least version 4.3.22 to use the
gprestore utilities. A Greenplum 5 system must be at least version 5.5. Be sure to use the latest release of the backup and restore utilities, available for download from VMware Tanzu Network or github.
Important: Make sure that you have a complete backup of all data in the Greenplum Database 4.3 or 5 cluster, and that you can successfully restore the Greenplum Database cluster if necessary.
Following are some issues that are known to cause errors when restoring a Greenplum 4.3 or 5 backup to Greenplum 6. Keep a list of any changes you make to the Greenplum 4.3 or 5 database to enable migration so that you can fix them in Greenplum 6 after restoring the backup.
- If you have configured PXF in your Greenplum Database 5 installation, review Migrating PXF from Greenplum 5 to plan for the PXF migration.
- Greenplum Database version 6 removes support for the
gphdfsprotocol. If you have created external tables that use gphdfs, remove the external table definitions and (optionally) recreate them to use Greenplum Platform Extension Framework (PXF) before you migrate the data to Greenplum 6. Refer to Migrating gphdfs External Tables to PXF in the PXF documentation for the migration procedure.
- References to catalog tables or their attributes can cause a restore to fail due to catalog changes from Greenplum 4.3 or 5 to Greenplum 6. Here are some catalog issues to be aware of when migrating to Greenplum 6:
- In the
pg_classsystem table, the
reltoastidxidcolumn has been removed.
- In the
pg_stat_replicationsystem table, the
procpidcolumn is renamed to
- In the
pg_stat_activitysystem table, the
procpidcolumn is renamed to
current_querycolumn is replaced by two columns:
state(the state of the backend), and
query(the last run query, or currently running query if
- In the
gp_distribution_policysystem table, the
attrnumscolumn is renamed to
distkeyand its data type is changed to
int2vector. A backend function
pg_get_table_distributedby()was added to get the distribution policy for a table as a string.
__gp_masteridcolumns are removed from the
session_level_memory_consumptionsystem view in Greenplum 6. The underlying external tables and functions are removed from the
- Filespaces are removed in Greenplum 6. The
pg_filespace_entrysystem tables are removed. Any reference to
pg_filespace_entrywill fail in Greenplum 6.
- Restoring a Greenplum 4 backup can fail due to lack of dependency checking in Greenplum 4 catalog tables. For example, restoring a UDF can fail if it references a custom data type that is created later in the backup file.
- In the
INTO error_tableclause of the
CREATE EXTERNAL TABLEand
COPYcommands was deprecated in Greenplum 4.3 and is unsupported in Greenplum 5 and 6. Remove this clause from any external table definitions before you create a backup of your Greenplum 4.3 system. The
ERROR_TABLEparameter of the
gploadutility load control YAML file must also be removed from any
gploadYAML files before you run
int4_avg_accum()function signature changed in Greenplum 6 from
int4_avg_accum(bigint, integer). This function is the state transition function (sfunc) called when calculating the average of a series of 4-byte integers. If you have created a custom aggregate in a previous Greenplum release that called the built-in
int4_avg_accum()function, you will need to revise your aggregate for the new signature.
string_agg(expression)function has been removed from Greenplum 6. The function concatenates text values into a string. You can replace the single argument function with the function
string_agg(expression, delimiter)and specify an empty string as the
delimiter, for example
offsetargument of the
lag(expr, offset[, default])window function has changed from
int8in Greenplum 4.3 to
int4in Greenplum 5 and 6.
gpbackupsaves the distribution policy and distribution key for each table in the backup so that data can be restored to the same segment. If a table’s distribution key in the Greenplum 4.3 or 5 database is incompatible with Greenplum 6,
gprestorecannot restore the table to the correct segment in the Greenplum 6 database. This can happen if the distribution key in the older Greenplum release has columns with data types not allowed in Greenplum 6 distribution keys, or if the data representation for data types has changed or is insufficient for Greenplum 6 to generate the same hash value for a distribution key. You should correct these kinds of problems by altering distribution keys in the tables before you back up the Greenplum database.
- Greenplum 6 requires primary keys and unique index keys to match a table’s distribution key. The leaf partitions of partitioned tables must have the same distribution policy as the root partition. These known issues should be corrected in the source Greenplum database before you back up the database:
- If the primary key is different than the distribution key for a table, alter the table to either remove the primary key or change the primary key to match the distribution key.
- If the key columns for a unique index are not a subset of the distribution key columns, before you back up the source database, drop the index and, optionally, recreate it with a compatible key.
- If a partitioned table in the source database has a
DISTRIBUTED BYdistribution policy, but has leaf partitions that are
DISTRIBUTED RANDOMLY, alter the leaf tables to match the root table distribution policy before you back up the source database.
- In Greenplum 4.3, the name provided for a constraint in a
CREATE TABLEcommand was the name of the index created to enforce the constraint, which could lead to indexes having the same name. In Greenplum 6, duplicate index names are not allowed; restoring from a Greenplum 4.3 backup that has duplicate index names will generate errors.
Columns of type
anyarrayare not supported as distribution keys in Greenplum 6.
If you have tables distributed on columns of type
anyarray, use the
ALTER TABLEcommand to set the distribution to
RANDOMbefore you back up the database. After the data is restored, you can set a new distribution policy.
In Greenplum 4.3 and 5, it was possible to
ALTERa table that has a primary key or unique index to be
DISTRIBUTED RANDOMLY. Greenplum 6 does not permit tables
DISTRIBUTED RANDOMLYto have primary keys or unique indexes. Restoring such a table from a Greenplum 4.3 or 5 backup will cause an error.
Greenplum 6 no longer automatically converts from the deprecated timestamp format
YYYYMMDDHH24MISS. The format could not be parsed unambiguously in previous Greenplum Database releases. You can still specify the
YYYYMMDDHH24MISSformat in conversion functions such as
to_charfor compatibility with other database systems. You can use input formats for converting text to date or timestamp values to avoid unexpected results or query execution failures. For example, this
SELECTcommand returns a timestamp in Greenplum Database 5 and fails in 6.
To convert the string to a timestamp in Greenplum Database 6, you must use a valid format. Both of these commands return a timestamp in Greenplum Database 6. The first example explicitly specifies a timestamp format. The second example uses the string in a format that Greenplum Database recognizes.
SELECT to_timestamp('20190905140000','YYYYMMDDHH24MISS'); SELECT to_timestamp('201909051 40000');
The timestamp issue also applies when you use the
::syntax. In Greenplum Database 6, the first command returns an error. The second command returns a timestamp.
SELECT '20190905140000'::timestamp ; SELECT '20190905 140000'::timestamp ;
Creating a table using the
CREATE TABLE AScommand in Greenplum 4.3 or 5 could create a table with a duplicate distribution key. The
gpbackuputility saves the table to the backup using a
CREATE TABLEcommand that lists the duplicate keys in the
DISTRIBUTED BYclause. Restoring this backup will cause a duplicate distribution key error. The
CREATE TABLE AScommand was fixed in Greenplum 5.10 to disallow duplicate distribution keys.
Greenplum 4.3 supports foreign key constraints on columns of different types, for example,
bigint, with implicit type conversion. Greenplum 5 and 6 do not support implicit type conversion. Restoring a table with a foreign key on columns with different data types causes an error.
Only Boolean operators can use Boolean negators. In Greenplum Database 4.3 and 5 it was possible to create a non-Boolean operator that specifies a Boolean negator function. For example, this
CREATE OPERATORcommand creates an integer
@@operator with a Boolean negator:
CREATE OPERATOR public.@@ ( PROCEDURE = int4pl, LEFTARG = integer, RIGHTARG = integer, NEGATOR = OPERATOR(public.!!) );
If you restore a backup containing an operator like this to a Greenplum 6 system,
gprestoreproduces an error:
ERROR: only boolean operators can have negators (SQLSTATE 42P13).
In Greenplum Database 4.3 and 5, the undocumented server configuration parameter
allow_system_table_modscould have a value of
all. In Greenplum 6, this parameter has changed to a Boolean value, with a default value of
false. If there are any references to this parameter in the source database, remove them to prevent errors during the restore.
gpbackup to create a
--metadata-only backup from the source Greenplum database and restore it to the Greenplum 6 system. This helps find any additional problems that are not identified in Preparing Greenplum 4.3 and 5 Databases for Backup. Refer to the Greenplum Backup and Restore documentation for syntax and examples for the
gprestore log file for error messages and correct any remaining problems in the source Greenplum database.
When you are able to restore a metadata backup successfully, create the full backup and then restore it to the Greenplum 6 system, or use
gpcopy (Tanzu Greenplum) to transfer the data. If needed, use the
gprestore filter options to omit schemas or tables that cannot be restored without error.
If you use
gpcopy to migrate Tanzu Greenplum data, initiate the
gpcopy operation from the Greenplum 4.3.26 (or later) or the 5.9 (or later) cluster. See Migrating Data with gpcopy for more information.
Important: When you restore a backup taken from a Greenplum Database 4.3 or 5 system,
gprestore warns that the restore will use legacy hash operators when loading the data. This is because Greenplum 6 has new hash algorithms that map distribution keys to segments, but the data in the backup set must be restored to the same segments as the cluster from which the backup was taken. The
gprestore utility sets the
gp_use_legacy_hashops server configuration parameter to
on when restoring to Greenplum 6 from an earlier version so that the restored tables are created using the legacy operator classes instead of the new default operator classes.
After restoring, you can redistribute these tables with the
gp_use_legacy_hashops parameter set to
off so that the tables use the new Greenplum 6 hash operators. See Working With Hash Operator Classes in Greenplum 6 for more information and examples.
Migrate any tables you skipped during the restore using other methods, for example using the
COPY TO command to create an external file and then loading the data from the external file into Greenplum 6 with the
COPY FROM command.
Recreate any objects you dropped in the Greenplum 4.3 or 5 database to enable migration, such as external tables, indexes, user-defined functions, or user-defined aggregates.
Here are some additional items to consider to complete your migration to Greenplum 6.
- Greenplum Database 5 and 6 remove automatic casts between the text type and other data types. After you migrate from Greenplum Database version 4.3 to version 6, this changed behavior may impact existing applications and queries. Refer to About Implicit Text Casting in Greenplum Database for information, including a discussion about VMware supported and unsupported workarounds.
- After migrating data you may need to modify SQL scripts, administration scripts, and user-defined functions as necessary to account for changes in Greenplum Database version 6. Review the Tanzu Greenplum 6.0.0 Release Notes for features and changes that may necessitate post-migration tasks.
- To use the new Greenplum 6 default hash operator classes, use the command
ALTER TABLE <table> SET DISTRIBUTED BY (<key>)to redistribute tables restored from Greenplum 4.3 or 5 backups. The
gp_use_legacy_hashopsparameter must be set to
offwhen you run the command. See Working With Hash Operator Classes in Greenplum 6 for more information about hash operator classes.
Greenplum 6 has new jump consistent hash operators that map distribution keys for distributed tables to the segments. The new hash operators enable faster database expansion because they don’t require redistributing rows unless they map to a different segment. The hash operators used in Greenplum 4.3 and 5 are present in Greenplum 6 as non-default legacy hash operator classes. For example, for integer columns, the new hash operator class is named
int_ops and the legacy operator class is named
This example creates a table using the legacy hash operator class
test=# SET gp_use_legacy_hashops=on; SET test=# CREATE TABLE t1 ( c1 integer, c2 integer, p integer ) DISTRIBUTED BY (c1); CREATE TABLE test=# \d+ t1 Table "public.t1" Column | Type | Modifiers | Storage | Stats target | Description --------+---------+-----------+---------+--------------+------------- c1 | integer | | plain | | c2 | integer | | plain | | p | integer | | plain | | Distributed by: (c1)
Notice that the distribution key is
c1. If the
gp_use_legacy_hashops parameter is
on and the operator class is a legacy operator class, the operator class name is not shown. However, if
off, the legacy operator class name is reported with the distribution key.
test=# SET gp_use_legacy_hashops=off; SET test=# \d+ t1 Table "public.t1" Column | Type | Modifiers | Storage | Stats target | Description --------+---------+-----------+---------+--------------+------------- c1 | integer | | plain | | c2 | integer | | plain | | p | integer | | plain | | Distributed by: (c1 cdbhash_int4_ops)
The operator class name is reported only when it does not match the setting of the
To change the table to use the new jump consistent operator class, use the
ALTER TABLE command to redistribute the table with the
gp_use_legacy_hashops parameter set to
Note: Redistributing tables with a large amount of data can take a long time.
test=# SHOW gp_use_legacy_hashops; gp_use_legacy_hashops ----------------------- off (1 row) test=# ALTER TABLE t1 SET DISTRIBUTED BY (c1); ALTER TABLE test=# \d+ t1 Table "public.t1" Column | Type | Modifiers | Storage | Stats target | Description --------+---------+-----------+---------+--------------+------------- c1 | integer | | plain | | c2 | integer | | plain | | p | integer | | plain | | Distributed by: (c1)
To verify the default jump consistent operator class has been used, set
gp_use_legacy_hashops to on before you show the table definition.
test=# SET gp_use_legacy_hashops=on; SET test=# \d+ t1 Table "public.t1" Column | Type | Modifiers | Storage | Stats target | Description --------+---------+-----------+---------+--------------+------------- c1 | integer | | plain | | c2 | integer | | plain | | p | integer | | plain | | Distributed by: (c1 int4_ops)