Greenplum MADlib Extension for Analytics

Greenplum MADlib Extension for Analytics

This chapter includes the following information:

About MADlib

MADlib is an open-source library for scalable in-database analytics. With the Greenplum Database MADlib extension, you can use MADlib functionality in a Greenplum Database.

MADlib provides data-parallel implementations of mathematical, statistical and machine-learning methods for structured and unstructured data. It provides an suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for transferring data between Greenplum Database and other tools.

MADlib requires the m4 macro processor version 1.4.13 or later.

MADlib can be used with PivotalR, an R package that enables users to interact with data resident in Greenplum Database using the R client. See About MADlib, R, and PivotalR.

Installing MADlib

For Pivotal Greenplum Database, the MADlib extension is available as a package. Download the package from Pivotal Network and then install it with the Greenplum Package Manager (gppkg).

To install MADlib on Greenplum Database, you install the Greenplum MADlib package on Greenplum Database and then install the MADlib function libraries on the databases that use MADlib. For the versions of the MADlib extension supported by your version of Greenplum Database, see the Greenplum Database Release Notes.

The gppkg utility installs Greenplum Database extensions, along with any dependencies, on all hosts across a cluster. It also automatically installs extensions on new hosts in the case of system expansion segment recovery.

Note: On Greenplum Database or later, install MADlib version 1.10 or later.

If you install or upgrade to MADlib 1.9.1 on Greenplum Database or later, you must run the MADlib script that fixes the madpack MADlib utility to work with Greenplum Database or later. MADlib versions 1.10 and later do not require that you run the script. You can provide the path to the MADlib installation with the --prefix option.

$ --prefix madlib-installation-path

If you do not include the --prefix option, the script uses the location $GPHOME/madlib.

For information about gppkg, see the Greenplum Database Utility Guide.

Installing the Greenplum Database MADlib Package

Before you install the MADlib package, make sure that your Greenplum database is running, you have sourced, and that the $MASTER_DATA_DIRECTORY and $GPHOME variables are set.

  1. Download the MADlib extension package from Pivotal Network.
  2. Copy the MADlib package to the Greenplum Database master host.
  3. Unpack the MADlib distribution package. For example:
    $ tar xzvf madlib-1.16-gp4.3orca-rhel5-x86_64.tar.gz
  4. Install the software package by running the gppkg command. For example:
    $ gppkg -i ./madlib-1.16-gp4.3orca-rhel5-x86_64/madlib-1.16-gp4.3orca-rhel5-x86_64.gppkg

Adding MADlib Functions to a Database

After installing the MADlib package, run the madpack command to add MADlib functions to Greenplum Database. madpack is in $GPHOME/madlib/bin.

$ madpack [-s schema_name] -p greenplum -c user@host:port/database install

For example, this command creates MADlib functions in the Greenplum database testdb running on server mdw on port 5432. The madpack command logs in as the user gpadmin and prompts for password. The target schema is madlib.

$ madpack -s madlib -p greenplum -c gpadmin@mdw:5432/testdb install

After installing the functions, The Greenplum Database gpadmin superuser role should grant all privileges on the target schema (in the example madlib) to users who will be accessing MADlib functions. Users without access to the functions will get the error ERROR: permission denied for schema MADlib.

The madpack install-check option runs test using Madlib modules to check the MADlib installation:

$ madpack -s madlib -p greenplum -c gpadmin@mdw:5432/testdb install-check
Note: The command madpack -h displays information for the utility.

Upgrading MADlib

You upgrade an installed MADlib package with the Greenplum Database gppkg utility and the MADlib madpack command.
Note: On Greenplum Database or later, upgrade to MADlib 1.10 or 1.11.

If you upgrade to MADlib 1.9.1, you must run execute the MADlib script. MADlib 1.10 and later do not require the script. See the Note in "Installing MADlib."

For information about the upgrade paths that MADlib supports, see the MADlib support and upgrade matrix in the MADlib FAQ page.

Upgrading a MADlib Package

To upgrade MADlib, run the gppkg utility with the -u option. This command upgrades an installed MADlib package to MADlib 1.16.
gppkg -u madlib-1.16-gp4.3orca-rhel5-x86_64.gppkg
Note: Upgrading from MADlib version 1.15 using gppkg -U does not work because of a change in the post-uninstall script that was introduced in version 1.15.1. If you are upgrading from MADlib version 1.15 to version 1.15.1 or newer, follow these steps:
  1. Remove the existing MADlib version 1.15 rpm (this does not affect the Greenplum Database installation):
    gppkg -r madlib-1.15.0
  2. Manually remove the remaining MADlib files:
    rm -rf /usr/local/greenplum-db-
  3. Install the newer MADlib version. For example:
    $ gppkg -i ./madlib-1.16-gp4.3orca-rhel5-x86_64.gppkg
  4. Upgrade the MADlib functions in each database. For example:
    madpack -p greenplum -c gpadmin@mdw:5432/testdb upgrade

Upgrading MADlib Functions

After you upgrade the MADlib package, you run the madpack command to upgrade the MADlib functions in Greenplum Database.

Note: The upgrade to MADlib 1.13 has a minor issue with some leftover knn functions.

Before running the madpack command to upgrade to MADlib 1.13, run these psql commands as the gpadmin user to drop the leftover functions from the databases where MADlib is installed.

psql db_name -c "DROP FUNCTION IF EXISTS schema.knn(VARCHAR);"
psql db_name -c "DROP FUNCTION IF EXISTS schema.knn();"

db_name is the name of the database. schema is the name of the MADlib schema.

See the MADlib Installation Guide.

The madpack upgrade command upgrades the MADlib functions in the database schema. This example command upgrades the MADlib functions in the schema madlib of the Greenplum Database test.

madpack -s madlib -p greenplum -c gpadmin@mdw:5432/testdb upgrade

Uninstalling MADlib

When you remove MADlib support from a database, routines that you created in the database that use MADlib functionality will no longer work.

Remove MADlib objects from the database

Use the madpack uninstall command to remove MADlib objects from a Greenplum database. For example, this command removes MADlib objects from the database testdb.

$ madpack  -s madlib -p greenplum -c gpadmin@mdw:5432/testdb uninstall

Uninstall the Greenplum Database MADlib Package

If no databases use the MADlib functions, use the Greenplum gppkg utility with the -r option to uninstall the MADlib package. When removing the package you must specify the package and version. This example uninstalls MADlib package version 1.9.

$ gppkg -r madlib-ossv1.9_pv1.9.5_gpdb4.3orca

You can run the gppkg utility with the options -q --all to list the installed extensions and their versions.

After you uninstall the package, restart the database.

$ gpstop -r


This example demonstrates the association rules data mining technique on a transactional data set. Association rule mining is a technique for discovering relationships between variables in a large data set. This example considers items in a store that are commonly purchased together. In addition to market basket analysis, association rules are also used in bioinformatics, web analytics, and other fields.

The example analyzes purchase information for seven transactions that are stored in a table with the MADlib function MADlib.assoc_rules. The function assumes that the data is stored in two columns with a single item and transaction ID per row. Transactions with multiple items consist of multiple rows with one row per item.

These commands create the table.

CREATE TABLE test_data (
   trans_id INT,
   product text

This INSERT command add the data to the table.

   (1, 'beer'),
   (1, 'diapers'),
   (1, 'chips'),
   (2, 'beer'),
   (2, 'diapers'),
   (3, 'beer'),
   (3, 'diapers'),
   (4, 'beer'),
   (4, 'chips'),
   (5, 'beer'),
   (6, 'beer'),
   (6, 'diapers'),
   (6, 'chips'),
   (7, 'beer'),
   (7, 'diapers');

The MADlib function MADlib.assoc_rules() analyzes the data and determines association rules with the following characteristics.

  • A support value of at least .40. Support is the ratio of transactions that contain X to all transactions.
  • A confidence value of at least .75. Confidence is the ratio of transactions that contain X to transactions that contain Y. One could view this metric as the conditional probability of X given Y.

This SELECT command determines association rules, creates the table assoc_rules, and adds the statistics to the table.

SELECT * FROM madlib.assoc_rules (
   .40,          -- support
   .75,          -- confidence
   'trans_id',   -- transaction column
   'product',    -- product purchased column
   'test_data',  -- table name
   'public',     -- schema name
   false);       -- display processing details

This is the output of the SELECT command. There are two rules that fit the characteristics.

output_schema | output_table | total_rules | total_time 
public        | assoc_rules  |           2 | 00:00:01.153283 
(1 row)

To view the association rules, you can run this SELECT command.

select pre, post, support from assoc_rules 
   order by support desc;

This is the output. The pre and post columns are the itemsets of left and right hand sides of the association rule respectively.

    pre    |  post  |      support
 {diapers} | {beer} | 0.714285714285714
 {chips}   | {beer} | 0.428571428571429
(2 rows)

Based on the data, beer and diapers are often purchased together. To increase sales, you might consider placing beer and diapers closer together on the shelves.


MADlib web site is at

MADlib documentation is at

PivotalR is a first class R package that enables users to interact with data resident in Greenplum Database and MADLib using an R client.

About MADlib, R, and PivotalR

The R language is an open-source language that is used for statistical computing. PivotalR is an R package that enables users to interact with data resident in Greenplum Database using the R client. Using PivotalR requires that MADlib is installed on the Greenplum Database.

PivotalR allows R users to leverage the scalability and performance of in-database analytics without leaving the R command line. The computational work is executed in-database, while the end user benefits from the familiar R interface. Compared with respective native R functions, there is an increase in scalability and a decrease in running time. Furthermore, data movement, which can take hours for very large data sets, is eliminated with PivotalR.

Key features of the PivotalR package:
  • Explore and manipulate data in the database with R syntax. SQL translation is performed by PivotalR.
  • Use the familiar R syntax for predictive analytics algorithms, for example linear and logistic regression. PivotalR accesses the MADlib in-database analytics function calls.
  • Comprehensive documentation package with examples in standard R format accessible from an R client.
  • The PivotalR package also supports access to the MADlib functionality.

For information about PivotalR, including supported MADlib functionality, see

The R package for PivotalR can be found at