Creating and Managing Tables
Greenplum Database tables are similar to tables in any relational database, except that table rows are distributed across the different segments in the system. When you create a table, you specify the table’s distribution policy.
CREATE TABLE command creates a table and defines its structure. When you create a table, you define:
- The columns of the table and their associated data types. See Choosing Column Data Types.
- Any table or column constraints to limit the data that a column or table can contain. See Setting Table and Column Constraints.
- The distribution policy of the table, which determines how Greenplum Database divides data across the segments. See Choosing the Table Distribution Policy.
- The way the table is stored on disk. See Choosing the Table Storage Model.
- The table partitioning strategy for large tables. See Creating and Managing Databases.
The data type of a column determines the types of data values the column can contain. Choose the data type that uses the least possible space but can still accommodate your data and that best constrains the data. For example, use character data types for strings, date or timestamp data types for dates, and numeric data types for numbers.
For table columns that contain textual data, specify the data type
TEXT. Specifying the data type
CHAR is not recommended. In Greenplum Database, the data types
TEXT handle padding added to the data (space characters added after the last non-space character) as significant characters, the data type
CHAR does not. For information on the character data types, see the
CREATE TABLE command in the Greenplum Database Reference Guide.
Use the smallest numeric data type that will accommodate your numeric data and allow for future expansion. For example, using
BIGINT for data that fits in
SMALLINT wastes storage space. If you expect that your data values will expand over time, consider that changing from a smaller datatype to a larger datatype after loading large amounts of data is costly. For example, if your current data values fit in a
SMALLINT but it is likely that the values will expand,
INT is the better long-term choice.
Use the same data types for columns that you plan to use in cross-table joins. Cross-table joins usually use the primary key in one table and a foreign key in the other table. When the data types are different, the database must convert one of them so that the data values can be compared correctly, which adds unnecessary overhead.
Greenplum Database has a rich set of native data types available to users. See the Greenplum Database Reference Guide for information about the built-in data types.
You can define constraints on columns and tables to restrict the data in your tables. Greenplum Database support for constraints is the same as PostgreSQL with some limitations, including:
CHECKconstraints can refer only to the table on which they are defined.
PRIMARY KEYconstraints must be compatible with their tableʼs distribution key and partitioning key, if any.
PRIMARY KEYconstraints are not allowed on append-optimized tables because the
UNIQUEindexes that are created by the constraints are not allowed on append-optimized tables.
FOREIGN KEYconstraints are allowed, but not enforced.
Constraints that you define on partitioned tables apply to the partitioned table as a whole. You cannot define constraints on the individual parts of the table.
Check constraints allow you to specify that the value in a certain column must satisfy a Boolean (truth-value) expression. For example, to require positive product prices:
=> CREATE TABLE products ( product_no integer, name text, price numeric CHECK (price > 0) );
Not-null constraints specify that a column must not assume the null value. A not-null constraint is always written as a column constraint. For example:
=> CREATE TABLE products ( product_no integer NOT NULL, name text NOT NULL, price numeric );
Unique constraints ensure that the data contained in a column or a group of columns is unique with respect to all the rows in the table. The table must be hash-distributed or replicated (not
DISTRIBUTED RANDOMLY). If the table is hash-distributed, the constraint columns must be the same as (or a superset of) the table’s distribution key columns. For example:
=> CREATE TABLE products ( `product_no` integer `UNIQUE`, name text, price numeric) ` DISTRIBUTED BY (``product_no``)`;
A primary key constraint is a combination of a
UNIQUE constraint and a
NOT NULL constraint. The table must be hash-distributed (not
DISTRIBUTED RANDOMLY), and the primary key columns must be the same as (or a superset of) the table’s distribution key columns. If a table has a primary key, this column (or group of columns) is chosen as the distribution key for the table by default. For example:
=> CREATE TABLE products ( `product_no` integer `PRIMARY KEY`, name text, price numeric) ` DISTRIBUTED BY (``product_no``)`;
Foreign keys are not supported. You can declare them, but referential integrity is not enforced.
Foreign key constraints specify that the values in a column or a group of columns must match the values appearing in some row of another table to maintain referential integrity between two related tables. Referential integrity checks cannot be enforced between the distributed table segments of a Greenplum database.
All Greenplum Database tables are distributed. When you create or alter a table, you optionally specify
DISTRIBUTED BY (hash distribution),
DISTRIBUTED RANDOMLY (round-robin distribution), or
DISTRIBUTED REPLICATED (fully distributed) to determine the table row distribution.
Note: The Greenplum Database server configuration parameter
gp_create_table_random_default_distribution controls the table distribution policy if the DISTRIBUTED BY clause is not specified when you create a table.
For information about the parameter, see “Server Configuration Parameters” of the Greenplum Database Reference Guide.
Consider the following points when deciding on a table distribution policy.
- Even Data Distribution — For the best possible performance, all segments should contain equal portions of data. If the data is unbalanced or skewed, the segments with more data must work harder to perform their portion of the query processing. Choose a distribution key that is unique for each record, such as the primary key.
- Local and Distributed Operations — Local operations are faster than distributed operations. Query processing is fastest if the work associated with join, sort, or aggregation operations is done locally, at the segment level. Work done at the system level requires distributing tuples across the segments, which is less efficient. When tables share a common distribution key, the work of joining or sorting on their shared distribution key columns is done locally. With a random distribution policy, local join operations are not an option.
- Even Query Processing — For best performance, all segments should handle an equal share of the query workload. Query workload can be skewed if a table’s data distribution policy and the query predicates are not well matched. For example, suppose that a sales transactions table is distributed on the customer ID column (the distribution key). If a predicate in a query references a single customer ID, the query processing work is concentrated on just one segment.
The replicated table distribution policy (
DISTRIBUTED REPLICATED) should be used only for small tables. Replicating data to every segment is costly in both storage and maintenance, and prohibitive for large fact tables. The primary use cases for replicated tables are to:
- remove restrictions on operations that user-defined functions can perform on segments, and
- improve query performance by making it unnecessary to broadcast frequently used tables to all segments.
Note: The hidden system columns (
gp_segment_id) cannot be referenced in user queries on replicated tables because they have no single, unambiguous value. Greenplum Database returns a
column does not exist error for the query.
CREATE TABLE’s optional clauses
DISTRIBUTED RANDOMLY, and
DISTRIBUTED REPLICATED specify the distribution policy for a table. The default is a hash distribution policy that uses either the
PRIMARY KEY (if the table has one) or the first column of the table as the distribution key. Columns with geometric or user-defined data types are not eligible as Greenplum Database distribution key columns. If a table does not have an eligible column, Greenplum Database distributes the rows randomly or in round-robin fashion.
Replicated tables have no distribution key because every row is distributed to every Greenplum Database segment instance.
To ensure even distribution of hash-distributed data, choose a distribution key that is unique for each record. If that is not possible, choose
DISTRIBUTED RANDOMLY. For example:
=> CREATE TABLE products ` (name varchar(40), prod_id integer, supplier_id integer) DISTRIBUTED BY (prod_id); `
=> CREATE TABLE random_stuff ` (things text, doodads text, etc text) DISTRIBUTED RANDOMLY; `
Important: If a primary key exists, it is the default distribution key for the table. If no primary key exists, but a unique key exists, this is the default distribution key for the table.
The hash function used for hash distribution policy is defined by the hash operator class for the column’s data type. As the default Greenplum Database uses the data type’s default hash operator class, the same operator class used for hash joins and hash aggregates, which is suitable for most use cases. However, you can declare a non-default hash operator class in the
DISTRIBUTED BY clause.
Using a custom hash operator class can be useful to support co-located joins on a different operator than the default equality operator (
This example creates a custom hash operator class for the integer data type that is used to improve query performance. The operator class compares the absolute values of integers.
Create a function and an equality operator that returns true if the absolute values of two integers are equal.
CREATE FUNCTION abseq(int, int) RETURNS BOOL AS $$ begin return abs($1) = abs($2); end; $$ LANGUAGE plpgsql STRICT IMMUTABLE; CREATE OPERATOR |=| ( PROCEDURE = abseq, LEFTARG = int, RIGHTARG = int, COMMUTATOR = |=|, hashes, merges);
Now, create a hash function and operator class that uses the operator.
CREATE FUNCTION abshashfunc(int) RETURNS int AS $$ begin return hashint4(abs($1)); end; $$ LANGUAGE plpgsql STRICT IMMUTABLE; CREATE OPERATOR CLASS abs_int_hash_ops FOR TYPE int4 USING hash AS OPERATOR 1 |=|, FUNCTION 1 abshashfunc(int);
Also, create less than and greater than operators, and a btree operator class for them. We don’t need them for our queries, but the Postgres Planner will not consider co-location of joins without them.
CREATE FUNCTION abslt(int, int) RETURNS BOOL AS $$ begin return abs($1) < abs($2); end; $$ LANGUAGE plpgsql STRICT IMMUTABLE; CREATE OPERATOR |<| ( PROCEDURE = abslt, LEFTARG = int, RIGHTARG = int); CREATE FUNCTION absgt(int, int) RETURNS BOOL AS $$ begin return abs($1) > abs($2); end; $$ LANGUAGE plpgsql STRICT IMMUTABLE; CREATE OPERATOR |>| ( PROCEDURE = absgt, LEFTARG = int, RIGHTARG = int); CREATE FUNCTION abscmp(int, int) RETURNS int AS $$ begin return btint4cmp(abs($1),abs($2)); end; $$ LANGUAGE plpgsql STRICT IMMUTABLE; CREATE OPERATOR CLASS abs_int_btree_ops FOR TYPE int4 USING btree AS OPERATOR 1 |<|, OPERATOR 3 |=|, OPERATOR 5 |>|, FUNCTION 1 abscmp(int, int);
Now, you can use the custom hash operator class in tables.
CREATE TABLE atab (a int) DISTRIBUTED BY (a abs_int_hash_ops); CREATE TABLE btab (b int) DISTRIBUTED BY (b abs_int_hash_ops); INSERT INTO atab VALUES (-1), (0), (1); INSERT INTO btab VALUES (-1), (0), (1), (2);
Queries that perform a join that use the custom equality operator
|=| can take advantage of the co-location.
With the default integer opclass, this query requires Redistribute Motion nodes.
EXPLAIN (COSTS OFF) SELECT a, b FROM atab, btab WHERE a = b; QUERY PLAN ------------------------------------------------------------------ Gather Motion 4:1 (slice3; segments: 4) -> Hash Join Hash Cond: (atab.a = btab.b) -> Redistribute Motion 4:4 (slice1; segments: 4) Hash Key: atab.a -> Seq Scan on atab -> Hash -> Redistribute Motion 4:4 (slice2; segments: 4) Hash Key: btab.b -> Seq Scan on btab Optimizer: Postgres query optimizer (11 rows)
With the custom opclass, a more efficient plan is possible.
EXPLAIN (COSTS OFF) SELECT a, b FROM atab, btab WHERE a |=| b; QUERY PLAN ------------------------------------------------------------------ Gather Motion 4:1 (slice1; segments: 4) -> Hash Join Hash Cond: (atab.a |=| btab.b) -> Seq Scan on atab -> Hash -> Seq Scan on btab Optimizer: Postgres query optimizer (7 rows)