Pivotal Greenplum Streaming Server 1.4 Release Notes

Pivotal Greenplum Streaming Server 1.4 Release Notes

This document contains pertinent release information about the Pivotal Greenplum Streaming Server version 1.4 release. The Greenplum Streaming Server (GPSS) is included in certain Pivotal Greenplum Database 5.x and 6.x distributions. GPSS for Redhat/CentOS 6 and 7 is also updated and distributed independently of Greenplum Database. You may need to download and install the GPSS distribution from Pivotal Network to obtain the most recent version of this component.

Supported Platforms

Pivotal Greenplum Streaming Server 1.4.x is compatible with these Greenplum Database versions:

  • Pivotal Greenplum Database 5.17.0 and later
  • Pivotal Greenplum Database 6.0.0 and later

Release 1.4.2

Release Date: November 2, 2020

Greenplum Streaming Server 1.4.2 resolves issues and includes related changes.

Note: You may be required to perform upgrade actions for this release. Review Upgrading the Streaming Server to plan your upgrade to GPSS 1.4.2.

Changes

Greenplum Streaming Server 1.4.2 includes these changes:

  • GPSS now specifies the SSL prefer mode on the control channel to the Greenplum Database master host. GPSS previously explicitly disabled SSL on the channel.

Resolved Issues

Greenplum Streaming Server 1.4.2 resolves these issues:

n/a
Resolves an issue where GPSS recorded an incorrect count in the progress log file when the messages it received included offset gaps, such as with transaction control messages.
30776, 174685715
Resolves an issue where gpsscli stop would not respond (hang).
174685711
Resolves an issue where GPSS failed to load a large (>2GB) file. GPSS now transfers a file in multiple, smaller chunks when loading to Greenplum.
174984151
GPSS sent an HTTP request to the Avro schema registry service on every segment on every commit; in some cases, this created and destroyed a large number of TCP connections in the process. GPSS resolves this issue by reading the schema a single time per session (as long as the schema remains unchanged).

Release 1.4.1

Release Date: August 7, 2020

Greenplum Streaming Server 1.4.1 resolves issues and includes related changes.

Note: You may be required to perform upgrade actions for this release. Review Upgrading the Streaming Server to plan your upgrade to GPSS 1.4.1.

Changes

Greenplum Streaming Server 1.4.1 includes these changes:

  • GPSS bundles a patched version of the librdkafka library to fix an issue that can arise when the Kafka topic that GPSS loads includes messages with discontinuous offsets. See resolved issue 30797, 30776.
  • GPSS now always tracks Kafka job progress in a separate, CSV-format log file. See resolved issue 173603095 and Checking the Progress of a Load Operation.
  • GPSS 1.4.1 changes the format and content of the server and client log file messsages. The old log file format was delimited text, which could not be parsed when the text contained a newline. The log files are now CSV-format and include a header row. See resolved issue 173603029 and Examining GPSS Log Files.

Resolved Issues

Greenplum Streaming Server 1.4.1 resolves these issues:

n/a
When the schema registry service was down, GPSS appeared to hang during a Kafka load operation because it tried to access the registry multiple times for each Kafka message. This issue is resolved; GPSS now reports an error and stops retrying immediately when it detects that the schema registry is down.
30797, 30776
Due to a bug in the dependent library librdkafka, a load job from Kafka would hang when there were aborted Kafka transactions in the topic, or when the messages were deleted before GPSS was able to consume them. This issue is resolved. GPSS 1.4.1 bundles a patched version of the librdkafka library and can now handle message offsets that are not continuous.
30760
Certain merge/update operations failed with the error Cannot parallelize an UPDATE statement that updates the distribution columns because GPSS versions 1.3.5 through 1.4.0 used the Greenplum Postgres Planner by default, which does not support updating columns that are specified as the distribution key. GPSS 1.4.1 resolves this issue by not explicitly specifying a query planner/optimizer, but rather using the default that is configured in the Greenplum cluster.
173653147
In some cases, gpsscli stop would hang when you invoked it to stop a Kafka load job that GPSS had previously retried. This issue is resolved.
173637940
The GPSS utilities distributed in the Greenplum Database 6.8.x and 6.9.0 Client and Loader Tools packages were missing the dependent library libserdes.so. This issue is resolved, the package now includes this library.
173637900
The GPSS 1.4.1 Batch Data gRPC API fixes a parallel loading regression that manifested itself when the gpss.json server configuration file included the (default) ReuseTables: true property setting.
173603095
Because GPSS tracked job progress only during gpsscli progress command execution, the progress information for jobs for which you did not run the command was lost. This issue is resolved. GPSS now always tracks job progress in a separate, CSV-format log file (with header row) named progress_jobname_jobid_date.log.
173603029
GPSS log file messages with embedded newlines could not be parsed. This issue is resolved; GPSS changes the client and server log file format to CSV (with header row).

Release 1.4.0

Release Date: June 26, 2020

Greenplum Streaming Server 1.4.0 adds new features, includes changes, and resolves issues.

Note: You may be required to perform upgrade actions for this release. Review Upgrading the Streaming Server to plan your upgrade to GPSS 1.4.0.

New and Changed Features

Greenplum Streaming Server 1.4.0 includes these new and changed features:

  • GPSS supports loading from a file data source. You can now load data in Avro, binary, CSV, and JSON files into Greenplum Database. See Loading File Data into Greenplum for more information.
  • GPSS defines a new META load configuration property block. You can load the properties in this single JSON-format column into the target table, or use the properties in update or merge criteria for a load operation. The available META properties are data-source specific:
    • The Kafka data source exposes the following META properties: topic (text), partition (int), and offset (bigint).
    • The file data source exposes a single META property named filename (text).
  • GPSS supports Avro data containing binary fields.
  • GPSS implements a faster update in merge mode for large datasets when the load configuration specifies no UPDATE_COLUMNS. In this scenario, GPSS updates all MAPPING columns in each row.
  • You can use GPSS to load data into a Greenplum Database cluster that utilizes the PgBouncer connection pooler.
  • The CentOS 7.x GPSS packages for Greenplum 6 support Oracle Enterprise Linux 7.
  • GPSS uses a single thread and socket per partition by sharing a Kafka consumer between workers.
  • GPSS bundles librdkafka version 1.4.2. This version provides support for controlling how GPSS reads Kafka messages written transactionally via the isolation.level property.
  • GPSS 1.4 introduces the new Streaming Job API (Beta), a gRPC API that allows you to manage and submit streaming jobs to the server.

Resolved Issues

Greenplum Streaming Server 1.4.0 resolves these issues:

172142789
The GPSS Batch Data gRPC API fixes inaccurate TransferStats success and error counts for data load operations initiated in update mode.

Deprecated Features

Deprecated features may be removed in a future minor release of the Greenplum Streaming Server. GPSS 1.4.x deprecates:

  • The gpkafka Version 1 configuration file format (deprecated since 1.4.0).
  • The gpkafka.yaml (versions 1 and 2) POLL block, including the POLL:BATCHSIZE and POLL:TIMEOUT properties (deprecated since 1.3.5).

Removed Features

Deprecated features may be removed in a future minor release of the Greenplum Streaming Server. GPSS 1.4.x removes:

  • The gpsscli history and gpkafka history commands (deprecated in 1.3.5).

Known Issues

Greenplum Streaming Server 1.4.x has these known issues:

N/A
The Greenplum Streaming Server may consume a very large amount of system memory when you use it to load a huge (hundreds of GBs) file, in some cases causing the Linux kernel to kill the GPSS server process. Do not use GPSS to load very large files; instead, use gpfdist.
30503
Due to limitations in the Greenplum Database external table framework, GPSS cannot log a data type conversion error that it encounters while evaluating a mapping expression. For example, if you use the expression EXPRESSION: (jdata->>'id')::int in your load configuration file, and the content of jdata->>'id' is a string that includes non-integer characters, the evaluation fails and GPSS terminates the load job. GPSS cannot log and propagate the error back to the user via gp_read_error_log().

Workarounds for Kafka: Skip the bad Kafka message by specifying a --force--reset-xxx flag on the job start or load command, or correct the message and publish it to another Kafka topic before loading it into Greenplum Database.