System Expansion Overview
System Expansion Overview
You can perform a Greenplum Database expansion to add segment instances and segment hosts with minimal downtime. In general, adding nodes to a Greenplum cluster achieves a linear scaling of performance and storage capacity.
Data warehouses typically grow over time, often at a continuous pace, as additional data is gathered and the retention period increases for existing data. At times, it is necessary to increase database capacity to consolidate disparate data warehouses into a single database. The data warehouse may also require additional computing capacity (CPU) to accommodate added analytics projects. It is good to provide capacity for growth when a system is initially specified, but even if you anticipate high rates of growth, it is generally unwise to invest in capacity long before it is required. Database expansion, therefore, is a project that you should expect to have to run periodically.
- Scalable capacity and performance. When you add resources to a Greenplum Database, the capacity and performance are the same as if the system had been originally implemented with the added resources.
- Uninterrupted service during expansion. Regular workloads, both scheduled and ad-hoc, are not interrupted.
- Transactional consistency.
- Fault tolerance. During the expansion, standard fault-tolerance mechanisms—such as segment mirroring—remain active, consistent, and effective.
- Replication and disaster recovery. Any existing replication mechanisms continue to function during expansion. Restore mechanisms needed in case of a failure or catastrophic event remain effective.
- Transparency of process. The expansion process employs standard Greenplum Database mechanisms, so administrators can diagnose and troubleshoot any problems.
- Configurable process. Expansion can be a long running process, but it can be fit into a schedule of ongoing operations. The expansion schema's tables allow administrators to prioritize the order in which tables are redistributed, and the expansion activity can be paused and resumed.
The planning and physical aspects of an expansion project are a greater share of the work than expanding the database itself. It will take a multi-discipline team to plan and run the project. For on-premise installations, space must be acquired and prepared for the new servers. The servers must be specified, acquired, installed, cabled, configured, and tested. For cloud deployments, similar plans should also be made. Planning New Hardware Platforms describes general considerations for deploying new hardware.
After you provision the new hardware platforms and set up their networks, configure the operating systems and run performance tests using Greenplum utilities. The Greenplum Database software distribution includes utilities that are helpful to test and burn-in the new servers before beginning the software phase of the expansion. See Preparing and Adding Hosts for steps to prepare the new hosts for Greenplum Database.
- The first step of the software phase of expansion process is preparing the Greenplum
Database system: adding new segment hosts and initializing new segment instances. This
phase can be scheduled to occur during a period of low activity to avoid disrupting
ongoing business operations. During the initialization process, the following tasks are
- Greenplum Database software is installed.
- Databases and database objects are created in the new segment instances on the new segment hosts.
- The gpexpand schema is created in the postgres database. You can use the tables and view in the schema to monitor and control the expansion process.
After the system has been updated, the new segment instances on the new segment hosts are available.
- New segments are immediately available and participate in new queries and data loads. The existing data, however, is skewed. It is concentrated on the original segments and must be redistributed across the new total number of primary segments.
- Because some of the table data is skewed, some queries might be less efficient because more data motion operations might be needed.
- The last step of the software phase is redistributing table data. Using the expansion
control tables in the gpexpand schema as a guide, tables are redistributed. For
- The gpexand utility redistributes the table data, across all of the servers, old and new, according to the distribution policy.
- The table's status is updated in the expansion control tables.
- After data redistribution, the query optimizer creates more efficient execution plans when data is not skewed.
When all tables have been redistributed, the expansion is complete.
Redistributing table data is a long-running process that creates a large volume of network and disk activity. It can take days to redistribute some very large databases. To minimize the effects of the increased activity on business operations, system administrators can pause and resume expansion activity on an ad hoc basis, or according to a predetermined schedule. Datasets can be prioritized so that critical applications benefit first from the expansion.
In a typical operation, you run the gpexpand utility four times with different options during the complete expansion process.
- To create an expansion input
gpexpand -f hosts_file
- To initialize segments
and create the expansion
gpexpand -i input_file
gpexpand creates a data directory, copies user tables from all existing databases on the new segments, and captures metadata for each table in an expansion schema for status tracking. After this process completes, the expansion operation is committed and irrevocable.
To redistribute table
gpexpand -d duration
During initialization, gpexpand adds and initializes new segment instances. To complete system expansion, you must run gpexpand to redistribute data tables across the newly added segment instances. Depending on the size and scale of your system, redistribution can be accomplished in a single session during low-use hours, or you can divide the process into batches over an extended period. Each table or partition is unavailable for read or write operations during redistribution. As each table is redistributed across the new segments, database performance should incrementally improve until it exceeds pre-expansion performance levels.
You may need to run gpexpand several times to complete the expansion in large-scale systems that require multiple redistribution sessions. gpexpand can benefit from explicit table redistribution ranking; see Planning Table Redistribution.
Users can access Greenplum Database during initialization, but they may experience performance degradation on systems that rely heavily on hash distribution of tables. Normal operations such as ETL jobs, user queries, and reporting can continue, though users might experience slower response times.
- To remove the expansion schema:
For information about the gpexpand utility and the other utilities that are used for system expansion, see the Greenplum Database Utility Guide.