Loading File Data into Greenplum

Loading File Data into Greenplum

You can use the gpsscli utility to load data from a file into Greenplum Database.

Load Procedure

You will perform the following tasks when you use the Greenplum Streaming Server to load file data into a Greenplum Database table:

  1. Ensure that you meet the Prerequisites.
  2. Register the Greenplum Streaming Server extension.
  3. Identify the format of the data.
  4. Construct the load configuration file.
  5. Create the target Greenplum Database table.
  6. Assign Greenplum Database role permissions to the table, if required.
  7. Run the gpsscli Client Commands to load the data into Greenplum Database.
  8. Check for load errors.

Prerequisites

Before using the gpsscli utilities to load file data to Greenplum Database, ensure that:

  • Your systems meet the Prerequisites documented for the Greenplum Streaming Server.
  • The file is accessible on the ETL server host, and the operating system user running the gpss server process has the appropriate permissions to access the file.

About Supported File Formats

To write file data into a Greenplum Database table, you must identify the format of the file in the load configuration file.

The Greenplum Streaming Server supports loading files of the following formats:

Format Description
avro Avro-format data. Specify the avro format when you want to load a single-object encoded Avro file. GPSS reads Avro data from the file and loads it into a single JSON-type column. You must define a mapping if you want GPSS to write the data into specific columns in the target Greenplum Database table.

GPSS supports libz-, lzma-, and snappy-compressed Avro data.

binary Binary format data. Specify the binary format when your file is binary format. GPSS reads binary data from the file and loads it into a single bytea-type column.
csv Comma-delimited text format data. Specify the csv format when your file data is comma-delimited text and conforms to RFC 4180. The file may not contain line ending characters (CR and LF).
json JSON- or JSONB-format data. Specify the json format when the file is in JSON or JSONB format. GPSS reads JSON data from the file only as a single column. You must define a mapping if you want GPSS to write the data into specific columns in the target Greenplum Database table.
Note: GPSS supports JSONB-format data only when loading to Greenplum 6.

Constructing the filesource.yaml Configuration File

You configure a data load operation from a file to Greenplum Database via a YAML-formatted configuration file. This configuration file includes parameters that identify the source file and information about the Greenplum Database connection and target table, as well as error thresholds for the operation.

The Greenplum Streaming Server supports only VERSION: 2 of the YAML configuration file when you load data into Greenplum from a file. Version 2 of the configuration file format supports loading both file and meta data to Greenplum.

Refer to the filesource.yaml reference page for configuration file format and the configuration properties supported.

A sample file load YAML configuration file named loadfromfile2.yaml follows:

DATABASE: ops
USER: gpadmin
PASSWORD: changeme
HOST: mdw-1
PORT: 5432
VERSION: 2
FILE:
   INPUT:
      SOURCE:
         URL: file:///tmp/file.csv
      VALUE:
         COLUMNS:
           - NAME: id
             TYPE: int
           - NAME: cname
             TYPE: text
           - NAME: oname
             TYPE: text
         FORMAT: csv
      META:
         COLUMNS:
           - NAME: meta
             TYPE: json
         FORMAT: json
      ERROR_LIMIT: 25
   OUTPUT:
      SCHEMA: gpschema
      TABLE: gptable
      MODE: INSERT
      MAPPING:
        - NAME: id
          EXPRESSION: id
        - NAME: cname
          EXPRESSION: cname
        - NAME: fname
          EXPRESSION: (meta->>'filename')::text
SCHEDULE:
  RETRY_INTERVAL: 500ms
  MAX_RETRIES: 2

Greenplum Database Options

You identify the Greenplum Database connection options via the DATABASE, USER, PASSWORD, HOST, and PORT parameters.

The VERSION parameter identifies the version of the GPSS YAML configuration file.
Note: You must specify version 2 when you load from a file data source into Greenplum.

FILE:INPUT Options

Specify the file location using the SOURCE:URL property. GPSS supports wildcards in the file path. If you want to read all files in a directory, specify dirname/*.

The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in VALUE:COLUMNS:NAME with a column name in the target Greenplum Database OUTPUT:TABLE:

  • You must identify the data elements in the order in which they appear in the file.
  • You must provide the same name for each data element and its associated Greenplum Database table column.
  • You must specify a compatible data type for each data element and its associated Greenplum Database table column.

The VALUE block must specify a FORMAT. The VALUE:FORMAT keyword identifies the format of the file. GPSS supports comma-delimited text (csv), binary (binary), JSON/JSONB (json), and Avro (avro) format files.

When you provide a META block, you must specify a single JSON-type COLUMNS and the FORMAT: json. Meta data for a file is a single text property named filename.

The FILTER parameter identifies a filter to apply to the file data before it is loaded into Greenplum Database. If the filter evaluates to true, GPSS loads the data. The data is dropped if the filter evaluates to false. The filter string must be a valid SQL conditional expression and may reference one or more VALUE column names.

The ERROR_LIMIT parameter identifies the number of errors or the error percentage threshold after which GPSS should exit the load operation.

FILE:OUTPUT Options

You identify the target Greenplum Database schema name and table name via the FILE:OUTPUT: SCHEMA and TABLE parameters. You must pre-create the Greenplum Database table before you attempt to load file data.

The default load mode is to insert file data into the Greenplum Database table. GPSS also supports updating and merging file data into a Greenplum table. You specify the load MODE, the MATCH_COLUMNS and UPDATE_COLUMNS, and any UPDATE_CONDITIONs that must be met to merge or update the data. In MERGE MODE, you can also specify ORDER_COLUMNS to filter out duplicates and a DELETE_CONDITION.

You can override the default mapping of the INPUT:VALUE:COLUMNS by specifying a MAPPING block in which you identify the association between a specific column in the target Greenplum Database table and a file data value element. You can also map the META data column, and map a Greenplum Database table column to a value expression.
Note: When you specify a MAPPING block, ensure that you provide entries for all data elements of interest - GPSS does not automatically match column names when you provide a MAPPING.

About the Merge Load Mode

MERGE mode is similar to an UPSERT operation; GPSS may insert new rows in the database, or may update an existing database row that satisfies match and update conditions. GPSS deletes rows in MERGE mode when the data satisfies an optional DELETE_CONDITION that you specify.

GPSS stages a merge operation in a temporary table, generating the SQL to populate the temp table from the set of OUTPUT configuration properties that you provide.

GPSS uses the following algorithm for MERGE mode processing:

  1. Create a temporary table like the target table.
  2. Generate the SQL to insert the source data into the temporary table.
    1. Add the MAPPINGS.
    2. Add the FILTER.
    3. Use MATCH_COLUMNS and ORDER_COLUMNS to filter out duplicates.
  3. Update the target table from rows in the temporary table that satisfy MATCH_COLUMNS, UPDATE_COLUMNS, and UPDATE_CONDITION.
  4. Insert non-matching rows into the target table.
  5. Delete rows in the target table that satisfy MATCH_COLUMNS and the DELETE_CONDITION.
  6. Truncate the temporary table.

About META, VALUEs, and FORMATs

You can specify the avro, binary, csv, and json data format in the Version 2 configuration file INPUT:VALUE:FORMAT, with some restrictions. You cannot specify a META block when INPUT:VALUE:FORMAT is csv.

About Transforming and Mapping Input Data

You can define a MAPPING between the input data (VALUE:COLUMNS and META:COLUMNS) and the columns in the target Greenplum Database table. Defining a mapping may be useful when you have a multi-field input column (such as a JSON-type column), and you want to assign individual components of the input field to specific columns in the target table.

You might also use a MAPPING to assign a value expression to a target table column. The expression must be one that you could specify in the SELECT list of a query, and can include a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so forth.

If you choose to map more than one input column in an expression, you can can create a user-defined function to parse and transform the input column and return the columns of interest.

For example, suppose you are loading a JSON file with the following contents:

{ "customer_id": 1313131, "some_intfield": 12 }
{ "customer_id": 77, "some_intfield": 7 }
{ "customer_id": 1234, "some_intfield": 56 }

You could define a user-defined function, udf_parse_json(), to parse the data as follows:

=> CREATE OR REPLACE FUNCTION udf_parse_json(value json)
     RETURNS TABLE (x int, y text)
   LANGUAGE plpgsql AS $$
     BEGIN
        RETURN query
        SELECT ((value->>'customer_id')::int), ((value->>'some_intfield')::text);
     END $$;

This function returns the two fields in each JSON record, casting the fields to integer and text, respectively.

An example MAPPING for file data in a JSON-format FILE:INPUT:COLUMNS named jdata follows:

MAPPING: 
  cust_id: (jdata->>'customer_id')
  field2: ((jdata->>'some_intfield') * .075)::decimal
  j1, j2: (udf_parse_json(jdata)).*

The Greenplum Database table definition for this example scenario is:

=> CREATE TABLE t1map( cust_id int, field2 decimal(7,2), j1 int, j2 text );

Creating the Greenplum Table

You must pre-create the Greenplum table before you load data into Greenplum Database. You use the FILE:OUTPUT: SCHEMA and TABLE load configuration file parameters to identify the schema and table names.

The target Greenplum table definition must include each column that GPSS will load into the table. The table definition may include additional columns; GPSS ignores these columns, and loads no data into them.

The name and data type that you specify for a column of the target Greenplum Database table must match the name and data type of the related, file data element. If you have defined a column mapping, the name of the Greenplum Database column must match the target column name that you specified for the mapping, and the type must match the target column type or expression that you define.

The CREATE TABLE command for the target Greenplum Database table receiving the file data defined in the loadfromfile2.yaml file presented in the Constructing the filesource.yaml Configuration File section follows:

 testdb=# CREATE TABLE payables.expenses2( id int8, cname text, fname text );