Pivotal Greenplum Streaming Server 1.4 Release Notes
Pivotal Greenplum Streaming Server 1.4 Release Notes
This document contains pertinent release information about the Pivotal Greenplum Streaming Server version 1.4 release. The Greenplum Streaming Server (GPSS) is included in certain Pivotal Greenplum Database 5.x and 6.x distributions. GPSS for Redhat/CentOS 6 and 7 is also updated and distributed independently of Greenplum Database. You may need to download and install the GPSS distribution from Pivotal Network to obtain the most recent version of this component.
Pivotal Greenplum Streaming Server 1.4.x is compatible with these Greenplum Database versions:
- Pivotal Greenplum Database 5.17.0 and later
- Pivotal Greenplum Database 6.0.0 and later
Release Date: November 2, 2020
Greenplum Streaming Server 1.4.2 resolves issues and includes related changes.
Greenplum Streaming Server 1.4.2 includes these changes:
- GPSS now specifies the SSL prefer mode on the control channel to the Greenplum Database master host. GPSS previously explicitly disabled SSL on the channel.
Greenplum Streaming Server 1.4.2 resolves these issues:
- Resolves an issue where GPSS recorded an incorrect count in the progress log file when the messages it received included offset gaps, such as with transaction control messages.
- 30776, 174685715
- Resolves an issue where gpsscli stop would not respond (hang).
- Resolves an issue where GPSS failed to load a large (>2GB) file. GPSS now transfers a file in multiple, smaller chunks when loading to Greenplum.
- GPSS sent an HTTP request to the Avro schema registry service on every segment on every commit; in some cases, this created and destroyed a large number of TCP connections in the process. GPSS resolves this issue by reading the schema a single time per session (as long as the schema remains unchanged).
Release Date: August 7, 2020
Greenplum Streaming Server 1.4.1 resolves issues and includes related changes.
Greenplum Streaming Server 1.4.1 includes these changes:
- GPSS bundles a patched version of the librdkafka library to fix an issue that can arise when the Kafka topic that GPSS loads includes messages with discontinuous offsets. See resolved issue 30797, 30776.
- GPSS now always tracks Kafka job progress in a separate, CSV-format log file. See resolved issue 173603095 and Checking the Progress of a Load Operation.
- GPSS 1.4.1 changes the format and content of the server and client log file messsages. The old log file format was delimited text, which could not be parsed when the text contained a newline. The log files are now CSV-format and include a header row. See resolved issue 173603029 and Examining GPSS Log Files.
Greenplum Streaming Server 1.4.1 resolves these issues:
- When the schema registry service was down, GPSS appeared to hang during a Kafka load operation because it tried to access the registry multiple times for each Kafka message. This issue is resolved; GPSS now reports an error and stops retrying immediately when it detects that the schema registry is down.
- 30797, 30776
- Due to a bug in the dependent library librdkafka, a load job from Kafka would hang when there were aborted Kafka transactions in the topic, or when the messages were deleted before GPSS was able to consume them. This issue is resolved. GPSS 1.4.1 bundles a patched version of the librdkafka library and can now handle message offsets that are not continuous.
- Certain merge/update operations failed with the error Cannot parallelize an UPDATE statement that updates the distribution columns because GPSS versions 1.3.5 through 1.4.0 used the Greenplum Postgres Planner by default, which does not support updating columns that are specified as the distribution key. GPSS 1.4.1 resolves this issue by not explicitly specifying a query planner/optimizer, but rather using the default that is configured in the Greenplum cluster.
- In some cases, gpsscli stop would hang when you invoked it to stop a Kafka load job that GPSS had previously retried. This issue is resolved.
- The GPSS utilities distributed in the Greenplum Database 6.8.x and 6.9.0 Client and Loader Tools packages were missing the dependent library libserdes.so. This issue is resolved, the package now includes this library.
- The GPSS 1.4.1 Batch Data gRPC API fixes a parallel loading regression that manifested itself when the gpss.json server configuration file included the (default) ReuseTables: true property setting.
- Because GPSS tracked job progress only during gpsscli progress command execution, the progress information for jobs for which you did not run the command was lost. This issue is resolved. GPSS now always tracks job progress in a separate, CSV-format log file (with header row) named progress_jobname_jobid_date.log.
- GPSS log file messages with embedded newlines could not be parsed. This issue is resolved; GPSS changes the client and server log file format to CSV (with header row).
Release Date: June 26, 2020
Greenplum Streaming Server 1.4.0 adds new features, includes changes, and resolves issues.
New and Changed Features
Greenplum Streaming Server 1.4.0 includes these new and changed features:
- GPSS supports loading from a file data source. You can now load data in Avro, binary, CSV, and JSON files into Greenplum Database. See Loading File Data into Greenplum for more information.
- GPSS defines a new META load configuration
property block. You can load the properties in this single
JSON-format column into the target table, or use the properties in
update or merge criteria for a load operation. The available
META properties are data-source specific:
- The Kafka data source exposes the following META properties: topic (text), partition (int), and offset (bigint).
- The file data source exposes a single META property named filename (text).
- GPSS supports Avro data containing binary fields.
- GPSS implements a faster update in merge mode for large datasets when the load configuration specifies no UPDATE_COLUMNS. In this scenario, GPSS updates all MAPPING columns in each row.
- You can use GPSS to load data into a Greenplum Database cluster that utilizes the PgBouncer connection pooler.
- The CentOS 7.x GPSS packages for Greenplum 6 support Oracle Enterprise Linux 7.
- GPSS uses a single thread and socket per partition by sharing a Kafka consumer between workers.
- GPSS bundles librdkafka version 1.4.2. This version provides support for controlling how GPSS reads Kafka messages written transactionally via the isolation.level property.
- GPSS 1.4 introduces the new Streaming Job API (Beta), a gRPC API that allows you to manage and submit streaming jobs to the server.
Greenplum Streaming Server 1.4.0 resolves these issues:
- The GPSS Batch Data gRPC API fixes inaccurate TransferStats success and error counts for data load operations initiated in update mode.
Deprecated features may be removed in a future minor release of the Greenplum Streaming Server. GPSS 1.4.x deprecates:
- The gpkafka Version 1 configuration file format (deprecated since 1.4.0).
- The gpkafka.yaml (versions 1 and 2) POLL block, including the POLL:BATCHSIZE and POLL:TIMEOUT properties (deprecated since 1.3.5).
Deprecated features may be removed in a future minor release of the Greenplum Streaming Server. GPSS 1.4.x removes:
- The gpsscli history and gpkafka history commands (deprecated in 1.3.5).
Greenplum Streaming Server 1.4.x has these known issues:
- The Greenplum Streaming Server may consume a very large amount of system memory when you use it to load a huge (hundreds of GBs) file, in some cases causing the Linux kernel to kill the GPSS server process. Do not use GPSS to load very large files; instead, use gpfdist.
- Due to limitations in the Greenplum Database external table
framework, GPSS cannot log a data type conversion error that it
encounters while evaluating a mapping expression. For example, if you
use the expression EXPRESSION: (jdata->>'id')::int
in your load configuration file, and the content of
jdata->>'id' is a string that includes non-integer
characters, the evaluation fails and GPSS terminates the load job.
GPSS cannot log and propagate the error back to the user via
Workarounds for Kafka: Skip the bad Kafka message by specifying a --force--reset-xxx flag on the job start or load command, or correct the message and publish it to another Kafka topic before loading it into Greenplum Database.