Accessing File-Based External Tables
Accessing File-Based External Tables
External tables enable accessing external files as if they are regular database tables. They are often used to move data into and out of a Greenplum database.
To create an external table definition, you specify the format of your input files and the location of your external data sources. For information about input file formats, see Formatting Data Files.
- file:// accesses external data files on segment hosts that the Greenplum Database superuser (gpadmin) can access. See file:// Protocol.
- gpfdist:// points to a directory on the file host and serves external data files to all Greenplum Database segments in parallel. See gpfdist:// Protocol.
- gpfdists:// is the secure version of gpfdist. See gpfdists:// Protocol.
gphdfs:// accesses files on a Hadoop Distributed File System
(HDFS). See gphdfs:// Protocol.
The files can be stored on an Amazon EMR instance HDFS. See Using Amazon EMR with Greenplum Database installed on AWS.
- s3:// accesses files in an Amazon S3 bucket. See s3:// Protocol.
The gphdfs:// and s3:// protocols are custom data access protocols, where the file://, gpfdist://, and gpfdists:// protocols are implemented internally in Greenplum Database. The custom and internal protocols differ in these ways:
- Custom protocols must be registered using the CREATE PROTOCOL command. The gphdfs:// protocol is preregistered when you install Greenplum Database. You can optionally register the s3:// protocol. (See Configuring and Using S3 External Tables.) Internal protocols are always present and cannot be unregistered.
- When a custom protocol is registered, a row is added to the pg_extprotocol catalog table to specify the handler functions that implement the protocol. The protocol's shared libraries must have been installed on all Greenplum Database hosts. The internal protocols have no additional libraries to install and they are not represented in the pg_extprotocol table.
- To grant users permissions on custom protocols, you use GRANT [SELECT | INSERT | ALL] ON PROTOCOL. To allow (or deny) users permissions on the internal protocols, you use CREATE ROLE or ALTER ROLE to add the CREATEEXTTABLE (or NOCREATEEXTTABLE) attribute to each user's role.
External tables access external files from within the database as if they are regular database tables. External tables defined with the gpfdist/gpfdists, gphdfs, and s3 protocols utilize Greenplum parallelism by using the resources of all Greenplum Database segments to load or unload data. The gphdfs protocol leverages the parallel architecture of the Hadoop Distributed File System to access files on that system. The s3 protocol utilizes the Amazon Web Services (AWS) capabilities.
You can query external table data directly and in parallel using SQL commands such as SELECT, JOIN, or SORT EXTERNAL TABLE DATA, and you can create views for external tables.
The steps for using external tables are:
- Define the external table.
To use the s3 protocol, you must also configure Greenplum Database and enable the protocol. See s3:// Protocol.
- Do one of the following:
- Start the Greenplum Database file server(s) when using the gpfdist or gpdists protocols.
- Verify that you have already set up the required one-time configuration for the gphdfs protocol.
- Verify the Greenplum Database configuration for the s3 protocol.
- Place the data files in the correct locations.
- Query the external table with SQL commands.
Greenplum Database provides readable and writable external tables:
- Readable external tables for data loading. Readable external tables support basic extraction, transformation, and loading (ETL) tasks common in data warehousing. Greenplum Database segment instances read external table data in parallel to optimize large load operations. You cannot modify readable external tables.
- Writable external tables for data unloading. Writable external tables support:
- Selecting data from database tables to insert into the writable external table.
- Sending data to an application as a stream of data. For example, unload data from Greenplum Database and send it to an application that connects to another database or ETL tool to load the data elsewhere.
- Receiving output from Greenplum parallel MapReduce calculations.
Writable external tables allow only INSERT operations.
External tables can be file-based or web-based. External tables using the file:// protocol are read-only tables.
- Regular (file-based) external tables access static flat files. Regular external tables are rescannable: the data is static while the query runs.
- Web (web-based) external tables access dynamic data sources, either on a web server with the http:// protocol or by executing OS commands or scripts. Web external tables are not rescannable: the data can change while the query runs.
Dump and restore operate only on external and web external table definitions, not on the data sources.