Reading JSON Data from HDFS
Use the PXF HDFS Connector to read JSON-format data. This section describes how to use PXF to access JSON data in HDFS, including how to create and query an external table that references a JSON file in the HDFS data store.
Prerequisites
Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from HDFS.
Working with JSON Data
JSON is a text-based data-interchange format. JSON data is typically stored in a file with a .json
suffix.
A .json
file will contain a collection of objects. A JSON object is a collection of unordered name/value pairs. A value can be a string, a number, true, false, null, or an object or an array. You can define nested JSON objects and arrays.
Sample JSON data file content:
{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user": {
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":{
"type":"Point",
"values":[
13,
99
]
}
}
In the sample above, user
is an object composed of fields named id
and location
. To specify the nested fields in the user
object as Greenplum Database external table columns, use .
projection:
user.id
user.location
coordinates
is an object composed of a text field named type
and an array of integers named values
.
In order to retrieve all the values of the values
array within a single column, define the corresponding Greenplum Database external table column with one of the following types: TEXT
, VARCHAR
, or BPCHAR
.
coordinates.values
Alternatively, use [n]
to identify specific elements of the values
array as separate Greenplum Database external table columns:
coordinates.values[0]
coordinates.values[1]
Refer to Introducing JSON for detailed information on JSON syntax.
JSON to Greenplum Database Data Type Mapping
To represent JSON data in Greenplum Database, map data values that use a primitive data type to Greenplum Database columns of the same type. JSON supports complex data types including projections and arrays. Use N-level projection to map members of nested objects and arrays to primitive data types.
The following table summarizes external mapping rules for JSON data.
JSON Data Type | PXF/Greenplum Data Type |
---|---|
Primitive type (integer, float, string, boolean, null) | Use the corresponding Greenplum Database built-in data type; see Greenplum Database Data Types. |
Array | Use TEXT , VARCHAR , or BPCHAR to retrieve the contents of the array in one column.Use [n] brackets to identify a specific array index to a member of primitive type. |
Object | Use dot . notation to specify each level of projection (nesting) to a member of a primitive type. |
JSON Data Read Modes
PXF supports two data read modes. The default mode expects one full JSON record per line. PXF also supports a read mode operating on JSON records that span multiple lines.
In upcoming examples, you will use both read modes to operate on a sample data set. The schema of the sample data set defines objects with the following member names and value data types:
- “created_at” - text
- “id_str” - text
- “user” - object
- “id” - integer
- “location” - text
- “coordinates” - object (optional)
- “type” - text
- “values” - array
- [0] - integer
- [1] - integer
The single-JSON-record-per-line data set follows:
{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values": [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{"id":287819058,"location":""}, "coordinates": null}
This is the data set for the multi-line JSON record data set:
{
"root":[
{
"record_obj":{
"created_at":"MonSep3004:04:53+00002013",
"id_str":"384529256681725952",
"user":{
"id":31424214,
"location":"COLUMBUS"
},
"coordinates":null
},
"record_obj":{
"created_at":"MonSep3004:04:54+00002013",
"id_str":"384529260872228864",
"user":{
"id":67600981,
"location":"KryberWorld"
},
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
}
}
]
}
You will create JSON files for the sample data sets and add them to HDFS in the next section.
Loading the Sample JSON Data to HDFS
The PXF HDFS connector reads native JSON stored in HDFS. Before you can use Greenplum Database to query JSON format data, the data must reside in your HDFS data store.
Copy and paste the single line JSON record sample data set above to a file named singleline.json
. Similarly, copy and paste the multi-line JSON record data set to a file named multiline.json
.
Note: Ensure that there are no blank lines in your JSON files.
Copy the JSON data files that you just created to your HDFS data store. Create the /data/pxf_examples
directory if you did not do so in a previous exercise. For example:
$ hdfs dfs -mkdir /data/pxf_examples
$ hdfs dfs -put singleline.json /data/pxf_examples
$ hdfs dfs -put multiline.json /data/pxf_examples
Once the data is loaded to HDFS, you can use Greenplum Database and PXF to query and analyze the JSON data.
Creating the External Table
Use the hdfs:json
profile to read JSON-format files from HDFS. The following syntax creates a Greenplum Database readable external table that references such a file:
CREATE EXTERNAL TABLE <table_name>
( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-file>?PROFILE=hdfs:json[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
The specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.
Keyword | Value |
---|---|
<path‑to‑hdfs‑file> | The path to the directory or file in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, PXF considers <path‑to‑hdfs‑file> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑hdfs‑file> must not specify a relative path nor include the dollar sign ($ ) character. |
PROFILE | The PROFILE keyword must specify hdfs:json . |
SERVER=<server_name> | The named server configuration that PXF uses to access the data. PXF uses the default server if not specified. |
<custom‑option> | <custom-option>s are discussed below. |
FORMAT ‘CUSTOM’ | Use FORMAT 'CUSTOM' with the hdfs:json profile. The CUSTOM FORMAT requires that you specify (FORMATTER='pxfwritable_import') . |
PXF supports single- and multi- line JSON records. When you want to read multi-line JSON records, you must provide an IDENTIFIER
<custom-option> and value. Use this <custom-option> to identify the name of a field whose parent JSON object you want to be returned as individual tuples.
The hdfs:json
profile supports the following <custom-option>s:
Option Keyword | Syntax, Example(s) | Description |
---|---|---|
IDENTIFIER | &IDENTIFIER=<value> &IDENTIFIER=created_at |
You must include the IDENTIFIER keyword and <value> in the LOCATION string only when you are accessing JSON data comprised of multi-line records. Use the <value> to identify the name of the field whose parent JSON object you want to be returned as individual tuples. |
IGNORE_MISSING_PATH | &IGNORE_MISSING_PATH=<boolean> |
Specify the action to take when <path-to-hdfs-file> is missing or invalid. The default value is false , PXF returns an error in this situation. When the value is true , PXF ignores missing path errors and returns an empty fragment. |
IDENTIFIER
, there is a possibility that PXF could return incorrect results. Should you need to, you can work around this edge case by compressing the JSON file, and having PXF read the compressed file.Example: Reading a JSON File with Single Line Records
Use the following CREATE EXTERNAL TABLE SQL command to create a readable external table that references the single-line-per-record JSON data file and uses the PXF default server.
CREATE EXTERNAL TABLE singleline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values" TEXT
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
If you need to identify specific elements of the coordinates.values
array in different columns, use []
brackets to identify the JSON array:
CREATE EXTERNAL TABLE singleline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values[0]" INTEGER,
"coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/singleline.json?PROFILE=hdfs:json')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
Notice the use of .
projection to access the nested fields in the user
and coordinates
objects. Also notice the use of []
to access specific elements of the coordinates.values[]
array.
To query the JSON data in the external table:
SELECT * FROM singleline_json_tbl;
If you have used the first option to retrieve the JSON array in a single TEXT
column and wish to convert the JSON array serialized as TEXT
back into a native Greenplum array type, you can use the example below:
SELECT user.id,
ARRAY(SELECT json_array_elements_text(coordinates.values::json))::int[] AS coords
FROM singleline_json_tbl;
Note that this is only possible when using Greenplum Database 6.x, as the function json_array_elements_text()
is not available with 5.x versions of Greenplum.
Example: Reading a JSON file with Multi-Line Records
The SQL command to create a readable external table from the multi-line-per-record JSON file is very similar to that of the single line data set above. You must additionally specify the LOCATION
clause IDENTIFIER
keyword and an associated value when you want to read multi-line JSON records. For example:
CREATE EXTERNAL TABLE multiline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values" TEXT
)
LOCATION('pxf://data/pxf_examples/multiline.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
created_at
identifies the member name of the first field in the JSON record record_obj
in the sample data schema.
If you need to identify specific elements of the coordinates.values
array in different columns, use []
brackets to define the JSON array:
CREATE EXTERNAL TABLE multiline_json_tbl(
created_at TEXT,
id_str TEXT,
"user.id" INTEGER,
"user.location" TEXT,
"coordinates.values[0]" INTEGER,
"coordinates.values[1]" INTEGER
)
LOCATION('pxf://data/pxf_examples/multiline.json?PROFILE=hdfs:json&IDENTIFIER=created_at')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
To query the JSON data in this external table:
SELECT * FROM multiline_json_tbl;