Python Data Science Module Package
A newer version of this documentation is available. Use the version menu above to view the most up-to-date release of the Greenplum 5.x documentation.
Python Data Science Module Package
Greenplum Database provides a collection of data science-related Python modules that can be used with the Greenplum Database PL/Python language. You can download these modules in .gppkg format from Pivotal Network.
This section contains the following information:
- Python Data Science Modules
- Installing the Python Data Science Module Package
- Uninstalling the Python Data Science Module Package
For information about the Greenplum Database PL/Python Language, see Greenplum PL/Python Language Extension.
Python Data Science Modules
Module Name | Description/Used For |
---|---|
Beautiful Soup | Navigating HTML and XML |
Gensim | Topic modeling and document indexing |
Keras (RHEL/CentOS 7 only) | Deep learning |
Lifelines | Survival analysis |
lxml | XML and HTML processing |
NLTK | Natural language toolkit |
NumPy | Scientific computing |
Pandas | Data analysis |
Pattern-en | Part-of-speech tagging |
pyLDAvis | Interactive topic model visualization |
PyMC3 | Statistical modeling and probabilistic machine learning |
scikit-learn | Machine learning data mining and analysis |
SciPy | Scientific computing |
spaCy | Large scale natural language processing |
StatsModels | Statistical modeling |
Tensorflow (RHEL/CentOS 7 only) | Numerical computation using data flow graphs |
XGBoost | Gradient boosting, classifying, ranking |
Installing the Python Data Science Module Package
Before you install the Python Data Science Module package, make sure that your Greenplum Database is running, you have sourced greenplum_path.sh, and that the $MASTER_DATA_DIRECTORY and $GPHOME environment variables are set.
$ yum install tk
- Locate the Python Data Science module package that you built or downloaded.
The file name format of the package is DataSciencePython-<version>-relhel<N>-x86_64.gppkg.
- Copy the package to the Greenplum Database master host.
- Use the gppkg command to install the package. For
example:
$ gppkg -i DataSciencePython-<version>-relhel<N>-x86_64.gppkg
gppkg installs the Python Data Science modules on all nodes in your Greenplum Database cluster. The command also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file.
- Restart Greenplum Database. You must re-source greenplum_path.sh before
restarting your Greenplum
cluster:
$ source /usr/local/greenplum-db/greenplum_path.sh $ gpstop -r
The Greenplum Database Python Data Science Modules are installed in the following directory:
$GPHOME/ext/DataSciencePython/lib/python2.7/site-packages/
Uninstalling the Python Data Science Module Package
Use the gppkg utility to uninstall the Python Data Science Module package. You must include the version number in the package name you provide to gppkg.
To determine your Python Data Science Module package version number and remove this package:
$ gppkg -q --all | grep DataSciencePython DataSciencePython-<version> $ gppkg -r DataSciencePython-<version>
The command removes the Python Data Science modules from your Greenplum Database cluster. It also updates the PYTHONPATH, PATH, and LD_LIBRARY_PATH environment variables in your greenplum_path.sh file to their pre-installation values.
Re-source greenplum_path.sh and restart Greenplum Database after you remove the Python Data Science Module package:
$ . /usr/local/greenplum-db/greenplum_path.sh $ gpstop -r