filesource.yaml
filesource.yaml
Load configuration file for a GPSS file data source.
Synopsis
DATABASE: db_name USER: user_name PASSWORD: password HOST: master_host PORT: greenplum_port VERSION: 2 FILE: INPUT: SOURCE: URL: file_path VALUE: [COLUMNS: - NAME: column_name TYPE: column_data_type [ ... ]] FORMAT: value_data_format [AVRO_OPTION: BYTES_TO_BASE64: boolean] [META: COLUMNS: - NAME: meta_column_name TYPE: { json | jsonb } FORMAT: json [FILTER: filter_string] [ERROR_LIMIT: { num_errors | percentage_errors }] OUTPUT: [SCHEMA: output_schema_name] TABLE: table_name [MODE: mode] [MATCH_COLUMNS: - match_column_name [ ... ]] [ORDER_COLUMNS: - order_column_name [ ... ]] [UPDATE_COLUMNS: - update_column_name [ ... ]] [UPDATE_CONDITION: update_condition] [DELETE_CONDITION: delete_condition] [SCHEDULE: RETRY_INTERVAL: retry_time MAX_RETRIES: num_retries]
PROPERTY: {{template_var}}
Description
You specify the configuration parameters for a Greenplum Streaming Server (GPSS) file load job in a YAML-formatted configuration file that you provide to the gpsscli submit or gpsscli load commands. There are two types of configuration parameters in this file - those that identify the Greenplum Database connection and target table, and parameters specific to the file data source that you will load into Greenplum.
This reference page uses the name filesource.yaml to refer to this file; you may choose your own name for the file.
The gpsscli utility processes the YAML configuration file keywords in order, using indentation (spaces) to determine the document hierarchy and the relationships between the sections. The use of white space in the file is significant, and keywords are case-sensitive.
Keywords and Values
- DATABASE: db_name
- The name of the Greenplum database.
- USER: user_name
- The name of the Greenplum Database user/role. This user_name must have permissions as described in Configuring Greenplum Database Role Privileges.
- PASSWORD: password
- The password for the Greenplum Database user/role.
- HOST: master_host
- The host name or IP address of the Greenplum Database master host.
- PORT: greenplum_port
- The port number of the Greenplum Database server on the master host.
- VERSION: 2
- The version of the GPSS load configuration file. GPSS supports version 2 of this format for a file data source.
- SOURCE
- The file input configuration parameters.
- URL: file_path
- The URL identifying the file or files to be loaded. You can specify wildcards in any element of the path. To load all files in a directory, specify dirname/*.
- VALUE:
- The field names, types, and format of the file data. You must specify
all data elements in the order in which they appear in the file.
- COLUMNS:NAME: column_name
- The name of a data value column. column_name must match the column name of the target Greenplum Database table.
- The default source-to-target data mapping behaviour of GPSS is to match a column name as defined in COLUMNS:NAME with a column name in the target Greenplum Database TABLE. You can override the default mapping by specifying a MAPPING block.
- COLUMNS:TYPE: data_type
- The data type of the column. You must specify a compatible data type for each data element and the associated Greenplum Database table column.
- FORMAT: data_format
- The format of the value data. You may specify a FORMAT
of avro, binary, csv,
or json for the value data, with some restrictions.
- avro
- When you specify the avro data format, you must define only a single json type column in COLUMNS. If the schema is registered in a Confluent Schema Registry, you must also provide the AVRO_OPTION.
- binary
- When you specify the binary data format, you must define only a single bytea type column in COLUMNS.
- csv
- When you specify the csv data format, the message content cannot contain line ending characters (CR and LF).
- When you specify FORMAT: csv, you must not provide a META block.
- json
- When you specify the json data format, you must define only a single json type column in COLUMNS.
- AVRO_OPTION:BYTES_TO_BASE64: boolean
- When true, GPSS converts Avro bytes fields into base64-encoded strings. The default value is false, GPSS does not perfrom the conversion.
- META:
- The field name, type, and format of the file meta data. META must specify a single json or jsonb (Greenplum 6 only) type column and FORMAT: json. The available meta data for a file is a single text-type property named filename. You can load this property into the target table with a MAPPING, or use the property in the update or merge criteria for a load operation.
- FILTER: filter_string
- The filter to apply to the input data before gpkafka loads the data into Greenplum Database. If the filter evaluates to true, GPSS loads the message. If the filter evaluates to false, the message is dropped. filter_string must be a valid SQL conditional expression and may reference one or more META or VALUE column names.
- ERROR_LIMIT: { num_errors | percentage_errors }
- The error threshold, specified as either an absolute number or a percentage. GPSS stops the load operation when this limit is reached. The default ERROR_LIMIT is zero; GPSS disables error logging and stops the load operation when it encounters the first error. Due to a limitation of the Greenplum Database external table framework, GPSS does not accept ERROR_LIMIT: 1.
- SCHEMA: output_schema_name
- The name of the Greenplum Database schema in which table_name resides. Optional, the default schema is the public schema.
- TABLE: table_name
- The name of the Greenplum Database table into which GPSS loads the file data.
- MODE: mode
- The table load mode. Valid mode values are INSERT, MERGE, or UPDATE. The default value is INSERT.
- UPDATE - Updates the target table columns that are listed in UPDATE_COLUMNS when the input columns identified in MATCH_COLUMNS match the named target table columns and the optional UPDATE_CONDITION is true.
- UPDATE is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
-
MERGE - Inserts new rows and updates existing
rows when:
- columns are listed in UPDATE_COLUMNS,
- the MATCH_COLUMNS target table column values are equal to the input data, and
- an optional UPDATE_CONDITION is specified and met.
- the MATCH_COLUMNS target table column values are equal to the input data, and
- an optional DELETE_CONDITION is specified and met.
- MERGE is not supported if the target table column name is a reserved keyword, has capital letters, or includes any character that requires quotes (" ") to identify the column.
- MATCH_COLUMNS:
- Required if MODE is MERGE or UPDATE.
- match_column_name
- Specifies the column(s) to use as the join condition for the update. The attribute value in the specified target column(s) must be equal to that of the corresponding source data column(s) in order for the row to be updated in the target table.
- ORDER_COLUMNS:
- Optional. May be specified in MERGE
MODE to sort the input data rows.
- order_column_name
- Specify the column(s) by which GPSS sorts the rows. When multiple matching rows exist in a batch, ORDER_COLUMNS is used with MATCH_COLUMNS to determine the input row with the largest value; GPSS uses that row to write/update the target.
- UPDATE_COLUMNS:
- Required if MODE is MERGE or UPDATE.
- update_column_name
- Specifies the column(s) to update for the rows that meet the MATCH_COLUMNS criteria and the optional UPDATE_CONDITION.
- UPDATE_CONDITION: update_condition
- Optional. Specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met in order for a row in the target table to be updated (or inserted, in the case of a MERGE).
- DELETE_CONDITION: delete_condition
- Optional. In MERGE MODE, specifies a boolean condition, similar to that which you would declare in a WHERE clause, that must be met for GPSS to delete rows in the target table that meet the MATCH_COLUMNS criteria.
- MAPPING:
- Optional. Overrides the default source-to-target column mapping. GPSS supports two mapping syntaxes.
-
Note: When you specify a MAPPING, ensure that you provide a mapping for all data value elements of interest. GPSS does not automatically match column names when you provide a MAPPING.
- NAME: target_column_name
- Specifies the target Greenplum Database table column name.
- EXPRESSION: { source_column_name | expression }
- Specifies a value or meta COLUMNS:NAME (source_column_name) or an expression. When you specify an expression, you may provide a value expression that you would specify in the SELECT list of a query, such as a constant value, a column reference, an operator invocation, a built-in or user-defined function call, and so on.
- target_column_name: { source_column_name | expression }
- When you use this MAPPING syntax, specify the target_column_name and {source_column_name | expression} as described above.
- SCHEDULE:
- Controls the frequency and interval of restarting failed jobs.
- RETRY_INTERVAL: retry_time
- The period of time that GPSS waits before retrying the job. You can specify the time interval in day (d), hour (h), minute (m), second (s), or millisecond (ms) integer units; do not mix units. The default retry interval is 5m (5 minutes).
- MAX_RETRIES: num_retries
- The maximum number of times that GPSS attempts to retry the job. The default is 0, do not retry. If you specify a negative value, GPSS retries the job indefinitely.
Template Variables
GPSS supports using template variables to specify property values in the load configuration file.
PROPERTY: {{template_var}}
MAX_RETRIES: {{numretries}}
GPSS substitutes the template variable with a value that you specify via the -p | --property template_var=value option to the gpsscli submit, gpsscli load, or gpkafka load command.
--property numretries=10GPSS substitutes occurrences of {{numretries}} in the load configuration file with the value 10 before submitting the job, and uses that value during job execution.
Notes
If you created a database object name using a double-quoted identifier (delimited identifier), you must specify the delimited name within single quotes in the filesource.yaml configuration file. For example, if you create a table as follows:
CREATE TABLE "MyTable" ("MyColumn" text);
Your filesource.yaml YAML configuration file would refer to the above table and column names as:
COLUMNS: - name: '"MyColumn"' type: text OUTPUT: TABLE: '"MyTable"'
Examples
Submit a job to load data from an Avro file as defined in the version 2 load configuration file named loadfromfile.yaml:
$ gpsscli submit loadfromfile.yaml
Example loadfromfile.yaml configuration file:
DATABASE: ops USER: gpadmin PASSWORD: changeme HOST: mdw-1 PORT: 15432 VERSION: 2 FILE: INPUT: SOURCE: URL: file:///tmp/file.avro VALUE: COLUMNS: - NAME: value TYPE: json FORMAT: avro META: COLUMNS: - NAME: meta TYPE: json FORMAT: json FILTER: (value->>'x')::int < 10 ERROR_LIMIT: 25 OUTPUT: SCHEMA: gpschema TABLE: gptable MODE: INSERT MAPPING: - NAME: a EXPRESSION: (value->>'x')::int - NAME: b EXPRESSION: (value->>'y')::text - NAME: c EXPRESSION: (meta->>'filename')::text SCHEDULE: RETRY_INTERVAL: 500ms MAX_RETRIES: 2