Configuring and Managing the Streaming Server

Configuring and Managing the Streaming Server

The Greenplum Streaming Server (GPSS) manages communication and data transfer between a client (for example, the Pivotal Greenplum-Informatica Connector) and Greenplum Database. You must configure and start a GPSS instance before you use the service to load data into Greenplum Database.

Topics in this section include:

Prerequisites

The Greenplum Streaming Server gpss and gpsscli command line utilities are automatically installed with Greenplum Database version 5.16 and later.

Before you start a GPSS server instance, ensure that you:

  • Install and start a compatible Greenplum Database version.
  • Can identify the hostname of your master node.
  • Can identify the port on which your Greenplum Database master server process is running, if it is not running on the default port (5432).
  • Select one or more GPSS host machines that have connectivity to:
    • The GPSS client host systems.
    • The Greenplum Database master and all segment hosts.

If you are using the gpsscli client utility, ensure that you run the command on a host that has connectivity to:

  • The client data source host systems. For example, for a Kafka data source, you must have connectivity to each broker host in the Kafka cluster.
  • The Greenplum Database master and all segment hosts.

Registering the GPSS Extension

The Greenplum Database and the Greenplum Streaming Server download packages install the GPSS extension. This extension must be registered in each database in which Greenplum users use GPSS to write data to Greenplum tables.

GPSS automatically registers its extension in a database the first time a Greenplum superuser or the database owner initiates a load job. You must manually register the extension in a database if non-privileged Greenplum users will be the first or only users of GPSS in that database.

Perform the following procedure as a Greenplum Database superuser or the database owner to manually register the GPSS extension:

  1. Open a new terminal window, log in to the Greenplum Database master host as the gpadmin administrative user, and set up the Greenplum environment. For example:
    $ ssh gpadmin@gpmaster
    gpadmin@gpmaster$ . /usr/local/greenplum-db/greenplum_path.sh
  2. Start the psql subsystem, connecting to a database in which you want to register the GPSS formatter function. For example:
    gpmaster$ psql -d testdb
  3. Enter the following command to register the extension:
    testdb=# CREATE EXTENSION gpss;
  4. Perform steps 2 and 3 for each database in which the Greenplum Streaming Server will write client data.

Configuring the Greenplum Streaming Server

You configure an invocation of the Greenplum Streaming Server via a JSON-formatted configuration file. This configuration file includes properties that identify the listen address of the GPSS service as well as the gpfdist service host, bind address, and port number. You can specify encryption options in the file, can configure a password shadow encode/decode key, and can aso configure whether GPSS reuses external tables.

The contents of a sample GPSS JSON configuration file named gpsscfg1.json follow:

{
    "ListenAddress": {
        "Host": "",
        "Port": 5019
    },
    "Gpfdist": {
        "Host": "",
        "Port": 8319,
        "ReuseTables": false,
        "BindAddress": "127.0.0.1"
    },
    "Shadow": {
        "Key": "a_very_secret_key"
    }
}

Refer to the gpss.json reference page for detailed information about the GPSS configuration file format and the configuration properties that the utility supports.

Note: If your Kafka or Greenplum Database clusters are using Kerberos authentication or SSL encryption, see Configuring the Streaming Server for Encryption and Authentication.

Running the Greenplum Streaming Server

You use the gpss utility to start an instance of the Greenplum Streaming Server on the local host. When you run the command, you provide the name of the configuration file that defines the properties of the GPSS and gpfdist service instances. You can also specify the name of a directory to which gpss writes server and progress log files. For example, to start a GPSS instance specifying a log directory named gpsslogs relative to the current working directory:

$ gpss gpsscfg1.json --log-dir ./gpsslogs

The default mode of operation for gpss is to wait for, and then consume, job requests and data from a client. When run in this mode, gpss waits indefinitely. You can interrupt and exit the command with Control-c. You may also choose to run gpss in the background (&). In both cases, gpss writes server log and status messages to stdout.

Note: gpss keeps track of the loading progress of client jobs in memory. When you stop a GPSS server instance, you lose all registered jobs. You must re-submit any previously-submitted jobs that you require after you restart the GPSS instance. gpss will resume a job from the last load offset.

Refer to the gpss reference page for additional information about this command.

Managing GPSS Log Files

If you specify the -l or --log-dir option when you start gpss or run a gpsscli subcommand, GPSS writes log messages to a file in the directory that you specify. If you do not provide this option, GPSS writes log messages to a file in the $HOME/gpAdminLogs directory.

GPSS writes server log messages to a file with the following naming format, where date identifies the date that the log file was created. This date reflects the date that you started the gpss server instance, or the date that the log was rotated for that server instance (see Rotating the GPSS Server Log File below):

gpss_date.log

GPSS writes client log messages to a file with the following naming format, where date identifies the date that you ran the command:

gpsscli_date.log

Starting in version 1.4.1, GPSS writes progress messages for each Kafka job to a separate file in the server log directory. Progress logs are written to a file with this naming format:

progress_jobname_jobid_date.log

jobname and jobid (max 8 characters each) identify the name and the identifier of the GPSS job, and date identifies the date that you ran the command.

Example GPSS log file names:
  • gpss_20181228.log
  • gpsscli_20181228.log
  • progress_jobk2_d577cf37_20200803.log

After GPSS creates a log file of a specific type, it appends all log messages written on date to the respective file.

Rotating the GPSS Server Log File

If the log file for a gpss server instance grows too large, you may choose to archive the current log and start fresh with an empty log file.

To prompt GPSS to rotate the server log file, you must:

  1. Rename the existing log file. For example:
    gpadmin@gpmaster$ mv logdir/gpss_date.log logdir/gpss_date.log.1
  2. Send the SIGUSR2 signal to the gpss server process. You can obtain the process id of a GPSS instance by running the ps command. For example:
    gpadmin@gpmaster$ ps -ef | grep gpss
    gpadmin@gpmaster$ kill -SIGUSR2 gpss_pid
    Note: There may be more than one gpss server process running on the system. Be sure to send the signal to the desired process.

    When it receives the signal, GPSS emits a log message that identifies the time at which it reset the log file. For example:

    ... -[INFO]:-Set gpss log file rotate at 20190911:20:59:36.093

Integrating with logrotate

You can configure and manage GPSS server log file rotation with the Linux logrotate utility.

This sample logrotate configuration rotates and compresses the log file of each gpss server instance running on the system weekly or when the file reaches 10MB in size. It operates on log files that are written to the default location:

/home/gpadmin/gpAdminLogs/gpss_*.log {
    rotate 5
    weekly
    size 10M
    postrotate
        pkill -SIGUSR2 gpss
    endscript
    compress
}

If this configuration is specified in a file named gpss_rotate.conf residing in the current working directory, you integrate with the Linux logrotate system with the following command:

$ logrotate -s status -d gpss_rotate.conf

You may choose to create a cron job to run this command daily.

Shadowing the Greenplum Database Password

When you use GPSS to load data into Greenplum Databse, you specify the Greenplum user/role password in the PASSWORD: property setting of a YAML-format load configuration file; see gpsscli.yaml.

You specify the Greenplum password in clear text. If your security requirements do not permit this, you can configure GPSS to encode and decode a shadow password string that the GPSS client and server use when communicating the Greenplum password.

Note: GPSS supports shadowing the Greenplum password only on load jobs that you submit and manage with the gpsscli subcommands. GPSS does not support shadowed passwords on load jobs that you submit with gpkafka load.

When you use this GPSS feature:

  1. (Optional) You configure a Shadow:Key in the gpss.json configuration file that you specify when you start the GPSS instance. For example:
    ...
        },
        "Shadow": {
            "Key": "a_very_secret_key"
        }
    ...
  2. You run the gpsscli shadow command on the ETL system to interactively generate the shadowed password. For example:
    $ gpsscli shadow --config gpss.json
    please input your password
    changemeCHANGEMEchangeme
    "SHADOW:ERTBKXDWLAJHUF5UOGJY34QTXIBNYP4ULTWVHIUZIF4UYFPRIJVA"
    You can automate this step using a command similar to the following:
    $ echo changemeCHANGEMEchangeme | gpsscli shadow --config gpss.json | tail -1
    "SHADOW:ERTBKXDWLAJHUF5UOGJY34QTXIBNYP4ULTWVHIUZIF4UYFPRIJVA"

    If you do not specify the --config gpss.json option, or this configuration file does not include a Shadow:Key setting, GPSS uses its default key to generate the shadow password string.

  3. You specify the shadow password string returned by gpsscli shadow in the PASSWORD: property setting of a gpsscli.yaml load configuration file. For example:
    DATABASE: testdb
    USER: testuser
    PASSWORD: "SHADOW:ERTBKXDWLAJHUF5UOGJY34QTXIBNYP4ULTWVHIUZIF4UYFPRIJVA"
    ...

    Always quote the complete shadow password string.

  4. You provide the load configuration file as an option to gpsscli submit or gpsscli load when you submit the job.
  5. The GPSS instance servicing the job uses its Shadow:Key, or the default key, to decode the shadowed password string specified in PASSWORD:, and connects with Greenplum Database.