VMware Tanzu Greenplum Streaming Server 1.6 Release Notes
VMware Tanzu Greenplum Streaming Server 1.6 Release Notes
This document contains pertinent release information about the VMware Tanzu Greenplum Streaming Server version 1.6 release. The Greenplum Streaming Server (GPSS) is included in certain Tanzu Greenplum 5.x and 6.x distributions. GPSS for Redhat/CentOS 6 and 7 and Ubuntu 18.04 is also updated and distributed independently of Greenplum Database. You may need to download and install the GPSS distribution from VMware Tanzu Network to obtain the most recent version of this component.
Tanzu Greenplum Streaming Server 1.6.x is compatible with these Tanzu Greenplum versions:
- Tanzu Greenplum 5.17.0 and later
- Tanzu Greenplum 6.0.0 and later
Release Date: May 28, 2021
Greenplum Streaming Server 1.6.0 adds new features, includes changes, and resolves issues.
New and Changed Features
Greenplum Streaming Server 1.6.0 includes these new and changed features:
- GPSS adds the -c | --config flag/option to the gpss command to specify the JSON-formatted configuration file.
- The gpsscli --version command now displays the version of the GPSS server in addition to displaying that of the client.
- The gpss.json server configuration file now includes a KeepAlive property block. Use the configuration properties in this block to specify timeout options for the gRPC connection between the GPSS client and the GPSS server.
- GPSS changes the format of front-end logs (messages written by commands to stdout) from CSV format to a more human-readable format. Related, GPSS adds a --csv-log option to the commands to write the front-end logs in CSV format. GPSS also adds a --color option to commands to enable the use of color in message display.
- GPSS exposes a new load configuration property for Kafka data sources named IDLE_DURATION (version 2 configuration) and idle_duration_ms (version 3 configuration). Use this property to specify that GPSS use lazy load mode, waiting until data arrives before locking the target Greenplum Database table.
- GPSS exposes a new load configuration property for Kafka data sources named SCHEMA_PATH_ON_GPDB (version 2 configuration) and schema_path_on_gpdb (version 3 configuration). Use this property to specify the path to the Avro .avsc file that contains the schema of the Kafka key or value data (but not both). This file must reside in the same location on all Greenplum Database segment hosts.
- GPSS exposes a new load configuration property for Kafka data sources named FALLBACK_OFFSET (version 2 configuration) and fallback_offset (version 3 configuration). Use this property to specify that GPSS automatically handle Kafka message offset mismatches, and how.
- GPSS exposes new load configuration properties for Kafka data sources to support access to an SSL-secured schema registry. Refer to Accessing an SSL-Secured Schema Registry for more information.
- GPSS now supports acting as a high-level Kafka consumer when the Kafka client properties include a group.id setting.
- GPSS exposes a new load configuration property for Kafka data sources named CONSISTENCY (version 2 configuration) and consistency (version 3 configuration). Use this property to specify how GPSS manages Kafka message offsets when it acts as a high-level consumer. Refer to Understanding Kafka Message Offset Management for more information.
- GPSS 1.6.0 provides additional documentation about developing and using custom formatters with GPSS.
Greenplum Streaming Server 1.6.0 includes these new Beta features:
- GPSS exposes a new load configuration property for Kafka data
sources named RECOVER_FAILING_BATCH (version 2
configuration) and recover_failing_batch
(version 3 configuration). Use this property in conjunction with
SAVE_FAILING_BATCH to instruct GPSS to
automatically reload the good data in the batch, and retain only
the error data in the backup table.
Note: Enabling this feature may have severe performance implications when any data in the Kafka topic generates an expression error.Note: This feature requires that GPSS has the Greenplum Database privileges to create a function.
- GPSS adds a new extension named dataflow. This extension includes a new data type, gp_jsonb (available for Greenplum Database version 6.x only), and a new formatter, text_in. You must CREATE EXTENSION dataflow; in each database in which you choose to use these types and formatters. For additional information about the gp_jsonb data type, see About the JSON Format and Column Type.
Greenplum Streaming Server 1.6.0 resolves this issue:
- Resolves an issue where job progress information was available only via stdout. GPSS now supports consumer groups, which saves message offsets to the Kafka topic.
- Resolves an issue where the GPSS Ubuntu download package was missing certain dependent libraries. These libraries are now marked as required.
- Resolves an issue where GPSS could not restart a job that had been stopped for a long period of time. GPSS now supports a FALLBACK_OPTION load configuration property that instructs GPSS to automatically handle offset mismatches, and how to handle them.
- Resolves an issue where GPSS was unable to load data from Kafka when TLS-secured communication was required between the Kafka broker and the schema registry. GPSS now supports load configuration properties to specify the certificates and keys required for this communication.
- Resolves an issue where GPSS was unable to load Avro data when the schema was not embedded in the .avro file. GPSS now supports the SCHEMA_PATH_ON_GPDB load configuration property to specify the .avsc schema file.
- Resolves a request for a job timeout by supporting a new IDLE_DURATION load configuration property.
- 30723, 30711
- Resolves an issue where GPSS failed to load JSON-format data that included \u0000 by creating a new Greenplum Database data type named gp_jsonb (Beta).
Deprecated features may be removed in a future release of the Greenplum Streaming Server. GPSS 1.6.x deprecates:
- Specifying the gpss.json configuration file to the gpss command standalone (deprecated since 1.6.0). Use the -c | --config option when you specify the file.
- The gpkafka Version 1 configuration file format (deprecated since 1.4.0).
- The gpkafka.yaml (versions 1 and 2) POLL block, including the POLL:BATCHSIZE and POLL:TIMEOUT properties (deprecated since 1.3.5).
Known Issues and Limitations
Greenplum Streaming Server 1.6.x has these known issues:
- GPSS does not support specifying both the key schema and the value schema using the SCHEMA_PATH_ON_GPDB property; you can specify the schema for only one or the other.
- The SAVE_FAILING_BATCH and PARTITIONS configuration properties are not supported when you use the version 1 configuration file format to load data.
- The Greenplum Streaming Server may consume a very large amount of system memory when you use it to load a huge (hundreds of GBs) file, in some cases causing the Linux kernel to kill the GPSS server process. Do not use GPSS to load very large files; instead, use gpfdist.
- Due to limitations in the Greenplum Database external table
framework, GPSS cannot log a data type conversion error that it
encounters while evaluating a mapping expression. For example, if you
use the expression EXPRESSION: (jdata->>'id')::int
in your load configuration file, and the content of
jdata->>'id' is a string that includes non-integer
characters, the evaluation fails and GPSS terminates the load job.
GPSS cannot log and propagate the error back to the user via
Workarounds for Kafka:
- Set the SAVE_FAILING_BATCH load configuration property to true, and then manually load any data batch that included expression errors.
- Skip the bad Kafka message by specifying a --force--reset-xxx flag on the job start or load command.
- Correct the message and publish it to another Kafka topic before loading it into Greenplum Database.