Transforming External Data with gpfdist and gpload
gpfdist parallel file server allows you to set up transformations that enable Greenplum Database external tables to read and write files in formats that are not supported with the
CREATE EXTERNAL TABLE command’s
FORMAT clause. An input transformation reads a file in the foreign data format and outputs rows to
gpfdist in the CSV or other text format specified in the external table’s
FORMAT clause. An output transformation receives rows from
gpfdist in text format and converts them to the foreign data format.
gpload are compatible only with the Greenplum Database major version in which they are shipped. For example, a
gpfdist utility that is installed with Greenplum Database 4.x cannot be used with Greenplum Database 5.x or 6.x.
This topic describes the tasks to set up data transformations that work with
gpfdist to read or write external data files with formats that Greenplum Database does not support.
- About gpfdist Transformations
- Determine the Transformation Schema
- Write a Transformation
- Write the gpfdist Configuration File
- Transfer the Data
- Configuration File Format
- XML Transformation Examples
Parent topic:Loading and Unloading Data
To set up a transformation for a data format, you provide an executable command that
gpfdist can call with the name of the file containing data. For example, you could write a shell script that runs an XSLT transformation on an XML file to output rows with columns delimited with a vertical bar (
|) character and rows delimited with linefeeds.
Transformations are configured in a YAML-formatted configuration file passed to
gpfdist on the command line.
If you want to load the external data into a table in the Greenplum database, you can use the
gpload utility to automate the tasks to create an external table, run
gpfdist, and load the transformed data into the database table.
Accessing data in external XML files from within the database is a common example requiring transformation. The following diagram shows gpfdist performing a transformation on XML files on an ETL server.
Following are the high-level steps to set up a
gpfdist transformation for external data files. The process is illustrated with an XML example.
- Determine the transformation schema.
- Write a transformation.
- Write the gpfdist configuration file.
- Transfer the data.
To prepare for the transformation project:
- Determine the goal of the project, such as indexing data, analyzing data, combining data, and so on.
- Examine the source files and note the file structure and element names.
- Choose the elements to import and decide if any other limits are appropriate.
For example, the following XML file, prices.xml, is a simple XML file that contains price records. Each price record contains two fields: an item number and a price.
<?xml version="1.0" encoding="ISO-8859-1" ?> <prices> <pricerecord> <itemnumber>708421</itemnumber> <price>19.99</price> </pricerecord> <pricerecord> <itemnumber>708466</itemnumber> <price>59.25</price> </pricerecord> <pricerecord> <itemnumber>711121</itemnumber> <price>24.99</price> </pricerecord> </prices>
The goal of this transformation is to import all the data into a Greenplum Database readable external table with an integer
itemnumber column and a decimal
The transformation specifies what to extract from the data. You can use any authoring environment and language appropriate for your project. For XML transformations choose from technologies such as XSLT, Joost (STX), Java, Python, or Perl, based on the goals and scope of the project.
In the price example, the next step is to transform the XML data into a two-column delimited text format.
708421|19.99 708466|59.25 711121|24.99
The following STX transform, called input_transform.stx, performs the data transformation.
<?xml version="1.0"?> <stx:transform version="1.0" xmlns:stx="http://stx.sourceforge.net/2002/ns" pass-through="none"> <!-- declare variables --> <stx:variable name="itemnumber"/> <stx:variable name="price"/> <!-- match and output prices as columns delimited by | --> <stx:template match="/prices/pricerecord"> <stx:process-children/> <stx:value-of select="$itemnumber"/> <stx:text>|</stx:text> <stx:value-of select="$price"/> <stx:text> </stx:text> </stx:template> <stx:template match="itemnumber"> <stx:assign name="itemnumber" select="."/> </stx:template> <stx:template match="price"> <stx:assign name="price" select="."/> </stx:template> </stx:transform>
This STX transform declares two temporary variables,
price, and the following rules.
- When an element that satisfies the XPath expression
/prices/pricerecordis found, examine the child elements and generate output that contains the value of the
|character, the value of the price variable, and a newline.
- When an
<itemnumber>element is found, store the content of that element in the variable
- When a
element is found, store the content of that element in the variable
The gpfdist configuration is specified as a YAML 1.1 document. It contains rules that gpfdist uses to select a transformation to apply when loading or extracting data.
This example gpfdist configuration contains the following items that are required for the prices.xml transformation scenario:
input_transform.shwrapper script, referenced in the
input_transform.stxjoost transformation, called from
Aside from the ordinary YAML rules, such as starting the document with three dashes (
---), a gpfdist configuration must conform to the following restrictions:
VERSIONsetting must be present with the value
TRANSFORMATIONSsetting must be present and contain one or more mappings.
- Each mapping in the
TYPEwith the value ‘input’ or 'output’
COMMANDindicating how the transformation is run.
- Each mapping in the
TRANSFORMATIONcan contain optional
The following gpfdist configuration, called
config.yaml, applies to the prices example. The initial indentation on each line is significant and reflects the hierarchical nature of the specification. The transformation name
prices_input in the following example will be referenced later when creating the table in SQL.
--- VERSION: 188.8.131.52 TRANSFORMATIONS: prices_input: TYPE: input COMMAND: /bin/bash input_transform.sh %filename%
COMMAND setting uses a wrapper script called
input_transform.sh with a
%filename% placeholder. When gpfdist runs the
prices_input transform, it invokes
/bin/bash and replaces the
%filename% placeholder with the path to the input file to transform. The wrapper script called
input_transform.sh contains the logic to invoke the STX transformation and return the output.
If Joost is used, the Joost STX engine must be installed.
#!/bin/bash # input_transform.sh - sample input transformation, # demonstrating use of Java and Joost STX to convert XML into # text to load into Greenplum Database. # java arguments: # -jar joost.jar joost STX engine # -nodecl don't generate a <?xml?> declaration # $1 filename to process # input_transform.stx the STX transformation # # the AWK step eliminates a blank line joost emits at the end java \ -jar joost.jar \ -nodecl \ $1 \ input_transform.stx \ | awk 'NF>0'
input_transform.sh file uses the Joost STX engine with the AWK interpreter. The following diagram shows the process flow as gpfdist runs the transformation.
Create the target database tables with SQL statements based on the appropriate schema.
There are no special requirements for Greenplum Database tables that hold loaded data. In the prices example, the following command creates the
prices table, where the data is to be loaded.
CREATE TABLE prices ( itemnumber integer, price decimal ) DISTRIBUTED BY (itemnumber);
Next, use one of the following approaches to transform the data with gpfdist.
gploadsupports only input transformations, but in many cases is easier to implement.
INSERT INTO SELECT FROMsupports both input and output transformations, but exposes details that
gploadautomates for you.
The Greenplum Database
gpload utility orchestrates a data load operation using the
gpfdist parallel file server and a YAML-formatted configuration file.
gpload automates these tasks:
- Creates a readable external table in the database.
gpfdistinstances with the configuration file that contains the transformation.
INSERT INTO table\_name SELECT FROM external\_tableto load the data.
- Removes the external table definition.
Transforming data with
gpload requires that the settings
TRANSFORM_CONFIG appear in the
INPUT section of the
gpload control file.
For more information about the syntax and placement of these settings in the
gpload control file, see the Greenplum Database Reference Guide.
TRANSFORM_CONFIGspecifies the name of the gpfdist configuration file.
TRANSFORMsetting indicates the name of the transformation that is described in the file named in
--- VERSION: 184.108.40.206 DATABASE: ops USER: gpadmin GPLOAD: INPUT: - TRANSFORM_CONFIG: config.yaml - TRANSFORM: prices_input - SOURCE: FILE: prices.xml
The transformation name must appear in two places: in the
TRANSFORM setting of the gpfdist configuration file and in the
TRANSFORMATIONS section of the file named in the
gpload control file, the optional parameter
MAX_LINE_LENGTH specifies the maximum length of a line in the XML transformation data that is passed to
The following diagram shows the relationships between the
gpload control file, the gpfdist configuration file, and the XML data file.
With this load method, you perform each of the tasks that
gpload automates. You start
gpfdist, create an external table, load the data, and clean up by dropping the table and stopping
Specify the transformation in the
CREATE EXTERNAL TABLE definition’s
LOCATION clause. For example, the transform is shown in bold in the following command. (Run gpfdist first, using the command
gpfdist -c config.yaml).
CREATE READABLE EXTERNAL TABLE prices_readable (LIKE prices) LOCATION ('gpfdist://hostname:8080/prices.xml#transform=prices_input') FORMAT 'TEXT' (DELIMITER '|') LOG ERRORS SEGMENT REJECT LIMIT 10;
In the command above, change
hostname to your hostname.
prices_input comes from the
gpfdist configuration file.
The following query then loads the data into the
INSERT INTO prices SELECT * FROM prices_readable;
The gpfdist configuration file uses the YAML 1.1 document format and implements a schema for defining the transformation parameters. The configuration file must be a valid YAML document.
The gpfdist program processes the document in order and uses indentation (spaces) to determine the document hierarchy and relationships of the sections to one another. The use of white space is significant. Do not use white space for formatting and do not use tabs.
The following is the basic structure of a configuration file.
--- VERSION: 220.127.116.11 TRANSFORMATIONS: transformation_name1: TYPE: input | output COMMAND: command CONTENT: data | paths SAFE: posix-regex STDERR: server | console transformation_name2: TYPE: input | output COMMAND: command ...
VERSION : Required. The version of the gpfdist configuration file schema. The current version is 18.104.22.168.
TRANSFORMATIONS : Required. Begins the transformation specification section. A configuration file must have at least one transformation. When gpfdist receives a transformation request, it looks in this section for an entry with the matching transformation name.
: Required. Specifies the direction of transformation. Values are
input: gpfdist treats the standard output of the transformation process as a stream of records to load into Greenplum Database.
output: gpfdist treats the standard input of the transformation process as a stream of records from Greenplum Database to transform and write to the appropriate output.
COMMAND : Required. Specifies the command gpfdist will run to perform the transformation.
For input transformations, gpfdist invokes the command specified in the
CONTENT setting. The command is expected to open the underlying file(s) as appropriate and produce one line of
TEXT for each row to load into Greenplum Database. The input transform determines whether the entire content should be converted to one row or to multiple rows.
For output transformations, gpfdist invokes this command as specified in the
CONTENT setting. The output command is expected to open and write to the underlying file(s) as appropriate. The output transformation determines the final placement of the converted output.
: Optional. The values are
paths. The default value is
data, the text
COMMANDsection is replaced by the path to the file to read or write.
paths, the text
COMMANDsection is replaced by the path to the temporary file that contains the list of files to read or write.
The following is an example of a
COMMAND section showing the text
%filename% that is replaced.
COMMAND: /bin/bash input_transform.sh %filename%
: Optional. A
POSIX regular expression that the paths must match to be passed to the transformation. Specify
SAFE when there is a concern about injection or improper interpretation of paths passed to the command. The default is no restriction on paths.
: Optional. The values are
This setting specifies how to handle standard error output from the transformation. The default,
server, specifies that gpfdist will capture the standard error output from the transformation in a temporary file and send the first 8k of that file to Greenplum Database as an error message. The error message will appear as an SQL error.
Console specifies that gpfdist does not redirect or transmit the standard error output from the transformation.
The following examples demonstrate the complete process for different types of XML data and STX transformations. Files and detailed instructions associated with these examples are in the GitHub repo
github.com://greenplum-db/gpdb in the gpMgmt/demo/gpfdist_transform directory. Read the README file in the Before You Begin section before you run the examples. The README file explains how to download the example data file used in the examples.
The output of a shell command or script defines command-based web table data. Specify the command in the
EXECUTE clause of
CREATE EXTERNAL WEB TABLE. The data is current as of the time the command runs. The
EXECUTE clause runs the shell command or script on the specified master, and/or segment host or hosts. The command or script must reside on the hosts corresponding to the host(s) defined in the
By default, the command is run on segment hosts when active segments have output rows to process. For example, if each segment host runs four primary segment instances that have output rows to process, the command runs four times per segment host. You can optionally limit the number of segment instances that run the web table command. All segments included in the web table definition in the
ON clause run the command in parallel.
The command that you specify in the external table definition runs from the database and cannot access environment variables from
.profile. Set environment variables in the
EXECUTE clause. For example:
=# CREATE EXTERNAL WEB TABLE output (output text) EXECUTE 'PATH=/home/gpadmin/programs; export PATH; myprogram.sh' FORMAT 'TEXT';
Scripts must be executable by the
gpadmin user and reside in the same location on the master or segment hosts.
The following command defines a web table that runs a script. The script runs on each segment host where a segment has output rows to process.
=# CREATE EXTERNAL WEB TABLE log_output (linenum int, message text) EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST FORMAT 'TEXT' (DELIMITER '|');
This example demonstrates loading a sample IRS Modernized eFile tax return using a Joost STX transformation. The data is in the form of a complex XML file.
The U.S. Internal Revenue Service (IRS) made a significant commitment to XML and specifies its use in its Modernized e-File (MeF) system. In MeF, each tax return is an XML document with a deep hierarchical structure that closely reflects the particular form of the underlying tax code.
XML, XML Schema and stylesheets play a role in their data representation and business workflow. The actual XML data is extracted from a ZIP file attached to a MIME “transmission file” message. For more information about MeF, see Modernized e-File (Overview) on the IRS web site.
The sample XML document, RET990EZ_2006.xml, is about 350KB in size with two elements:
The following is an abridged sample of the XML file.
<?xml version="1.0" encoding="UTF-8"?> <Return returnVersion="2006v2.0" xmlns="https://www.irs.gov/efile" xmlns:efile="https://www.irs.gov/efile" xsi:schemaLocation="https://www.irs.gov/efile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <ReturnHeader binaryAttachmentCount="1"> <ReturnId>AAAAAAAAAAAAAAAAAAAA</ReturnId> <Timestamp>1999-05-30T12:01:01+05:01</Timestamp> <ReturnType>990EZ</ReturnType> <TaxPeriodBeginDate>2005-01-01</TaxPeriodBeginDate> <TaxPeriodEndDate>2005-12-31</TaxPeriodEndDate> <Filer> <EIN>011248772</EIN> ... more data ... </Filer> <Preparer> <Name>Percy Polar</Name> ... more data ... </Preparer> <TaxYear>2005</TaxYear> </ReturnHeader> ... more data ..
The goal is to import all the data into a Greenplum database. First, convert the XML document into text with newlines “escaped”, with two columns:
ReturnId and a single column on the end for the entire MeF tax return. For example:
Load the data into Greenplum Database.
This example demonstrates loading sample data describing an oil rig using a Joost STX transformation. The data is in the form of a complex XML file downloaded from energistics.org.
The Wellsite Information Transfer Standard Markup Language (WITSML™) is an oil industry initiative to provide open, non-proprietary, standard interfaces for technology and software to share information among oil companies, service companies, drilling contractors, application vendors, and regulatory agencies. For more information about WITSML™, see http://www.energistics.org/.
The oil rig information consists of a top level
<rigs> element with multiple child elements such as
<documentInfo>, <rig>, and so on. The following excerpt from the file shows the type of information in the
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet href="../stylesheets/rig.xsl" type="text/xsl" media="screen"?> <rigs xmlns="http://www.energistics.org/schemas/131" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.energistics.org/schemas/131 ../obj_rig.xsd" version="22.214.171.124"> <documentInfo> ... misc data ... </documentInfo> <rig uidWell="W-12" uidWellbore="B-01" uid="xr31"> <nameWell>6507/7-A-42</nameWell> <nameWellbore>A-42</nameWellbore> <name>Deep Drill #5</name> <owner>Deep Drilling Co.</owner> <typeRig>floater</typeRig> <manufacturer>Fitsui Engineering</manufacturer> <yearEntService>1980</yearEntService> <classRig>ABS Class A1 M CSDU AMS ACCU</classRig> <approvals>DNV</approvals> ... more data ...
The goal is to import the information for this rig into Greenplum Database.
The sample document, rig.xml, is about 11KB in size. The input does not contain tabs so the relevant information can be converted into records delimited with a pipe (|).
W-12|6507/7-A-42|xr31|Deep Drill #5|Deep Drilling Co.|John Doe|John.Doe@example.com|
With the columns:
well_uid text, – e.g. W-12
well_name text, – e.g. 6507/7-A-42
rig_uid text, – e.g. xr31
rig_name text, – e.g. Deep Drill #5
rig_owner text, – e.g. Deep Drilling Co.
rig_contact text, – e.g. John Doe
rig_email text, – e.g. John.Doe@example.com
Then, load the data into Greenplum Database.