Loading Data with the Connector

You use the Apache NiFi user interface to configure a dataflow that uses the Greenplum Connector for Apache NiFi PutGreenplumRecord processor to load record-oriented data from any source into Greenplum Database. You will perform the following tasks when you use the Connector to load data into a Greenplum table:

  1. Ensure that you meet the prerequisites.
  2. Configure the Greenplum adapter controller service.
  3. Identify the format and schema of the input data.
  4. Configure the record reader controller service.
  5. Create the target Greenplum Database table.
  6. Build a dataflow that uses the PutGreenplumRecord processor.
  7. Start the dataflow.
  8. Check the load operation results.

Prerequisites

Before you set up a dataflow using the Connector, ensure that:

  • You have access to a running Greenplum Database cluster, and you can identify the host name or IP address of the master host and the port number on which the master server is running if it is not the default port (5432). Note the Greenplum master host and port number.
  • You can identify the host name or IP address and port number of a running Greenplum Streaming Server instance to which the Connector will direct load requests. Or, you configure and start a new streaming server instance as described in Configuring and Managing the Streaming Server in the Greenplum Streaming Server documentation.

    If you are running the Connector against Greenplum Streaming Server version 1.4.x or older, you must specify "ReuseTables": false in the gpss.json server configuration file that you use to start the Greenplum Streaming Server.

    Be sure to note the host and port number of the streaming server instance.

  • You can identify:

    • The name of the Greenplum database.
    • The name of the Greenplum Database table you want to load data into, and the name of the schema in which it resides.
    • The user/role name and password that you will use to access Greenplum Database. This role must be assigned certain privileges to the Greenplum database, schema, and table as described in Configuring Greenplum Database Role Privileges in the Greenplum Streaming Server documentation.
  • You have registered the Greenplum Streaming Server extension in the Greenplum database as described in Registering the GPSS Extension in the Greenplum Streaming Server documentation.

  • You can identify the Apache NiFi server host and port number.

  • Network connectivity exists between the Apache NiFi host(s) and the Greenplum Streaming Server host.

Configuring the Greenplum Adapter Controller Service

The PutGreenplumRecord processor uses a controller service to manage the connection to Greenplum Database.

In the Prerequisites section above, you identified the Greenplum Database master host and port number, Greenplum Streaming Server host and port number, the Greenplum database, and the Greenplum user/role name and password. You create and configure an instance of the GreenplumGPSSAdapter controller service for each unique combination of these property settings as described in About the Greenplum Adapter.

Identifying the Input Data Format and Schema

The PutGreenplumRecord processor can accept record-oriented data in the Avro, CSV, Json, Parquet, and XML formats, among others. The processor uses a Record Reader controller service to parse and deserialize data in the incoming FlowFiles. The format of the data inside the FlowFile informs both your choice of Record Reader, and the definition of the Greenplum Database table that the Connector loads data into.

The Record Reader requires the schema of the input data in order to parse and deserialize it.

About the Data Schema

NiFi data records are described by a schema. The schema defines the names and types of the fields in the input data records.

A NiFi schema definition is specified in Avro format. The example Avro schema below specifies a record with three fields, and identifies the names and data types of the fields:

{
  "name": "datatypes_record",
  "type": "record",
  "fields": [
    { "name": "lastname",  "type": ["string", "null"] },
    { "name": "age",       "type": ["int", "null"] },
    { "name": "birthdate", "type": {"type":"int", "logicalType":"date"} }
  ]
}

When you configure a Record Reader, you must specify the origin of the schema. A schema may be:

  • Inferred from the input data (auto-discovered).
  • Embedded in the input data, such as with Parquet and Avro files.
  • Explicitly specified when you configure the reader.
  • Retrieved from a schema registry.
Apache NiFi must read the whole FlowFile to infer the schema from the input data; this is often inefficient.

Once you identify and specify the schema of the incoming FlowFiles, you have the information that you need to select and configure the Record Reader and create the Greenplum Database table.

Configuring the Record Reader Controller Service

The PutGreenplumRecord processor uses a controller service to parse and deserialize incoming data in FlowFiles

You choose and configure a Record Reader type and instance that corresponds to the format of the data in the incoming FlowFiles as described in About the Record Reader. You must specify the origin of the schema when you configure the reader controller service.

Creating the Target Greenplum Table

You specify the name of the target Greenplum table when you configure the PutGreenplumRecord processor. You must create this table before you initiate a NiFi dataflow; the Connector does not create the table for you.

The data types that you specify for the target Greenplum Database table columns must match the data types of the input FlowFile record fields. You can reference the schema of the input records, or the data itself, to identify its type and definition.

The column names that you specify when you create the Greenplum table must match the input FlowFile field names with a few caveats. When you configure the PutGreenplumRecord processor, you specify if and how you want the Connector to translate the field names to Greenplum table column names. You can:

  • Turn case-sensitivity of the translation on/off.
  • Specify the behaviour of the Connector when there is no matching table column for one or more FlowFile record fields.
  • Specify the behaviour of the Connector when there is no matching FlowFile record field for one or more table columns.

Specifying Field and Column Name Mappings describes the available field-to-column translation configuration options for the Connector.

The Greenplum table may be the target of INSERT, UPDATE, or MERGE operations. To update or merge data in a Greenplum table, you must be able to identify a set of table columns that uniquely identifies a row in the table. The Connector uses these Match Columns to locate existing table rows. About the Insert, Merge, and Update Properties further describes these operations and configuration properties.

Building the DataFlow

When you build a dataflow, you drag components from the NiFi toolbar to the canvas, configure the components, and then connect them together.

When you build a dataflow using the PutGreenplumRecord processor, you configure it as described above, and in Configuring the Connector.

Remember to define the success, failure, and retry relationships for the FlowFiles processed by the PutGreenplumRecord, or auto-terminate them.

Starting the DataFlow

A component that you add to a dataflow is in the stopped state. You must start all linked components to initiate the flow.

Starting a component may also trigger a dataflow. In other cases, some external event triggers a flow, such as a new file added to a directory, or a new message emitted by some external data source.

Checking the Load Operation Results

When a load operation succeeds, the PutGreenplumRecord processor emits a SEND provenance event. These events are displayed in the global or processor-specific Data Provenance dialog.

You can detect load operation errors in one or more of the following ways:

  • View messages written to the $NIFI_HOME/logs/nifi-app.log file.
  • When the upper right corner of the processor component displays a red square, the processor has encountered one or more warnings or errors. Hovering over the red square pops up a dialog that displays the recent warnings and errors emitted by the processor.
  • If you have configured FlowFile routing to a failure relationship, examine the connection and downstream processor components.

You can also query the target Greenplum table to verify the load.