[MVP] gprebalance #1198

bimboterminator1 · 2025-01-29T11:51:09Z

Mvp for gprebalance utility

Implement cluster validation possibility This is the first commit for building an MVP for new rebalance utility - gprebalance. This utility is intended to be used for the situation, when after cluster resize (after expand, shrink) is in unbalanced state. Balanced state is defined very simple: if number of segments per host is equal across all the hosts, then cluster is balanced. There are a lot of other aspects for proper implementation of optimal rebalance algorithm, which will be implemented in the next patches. This patch adds the skeleton of future utility, providing initial validation of rebalance possibility. It includes checks, that validate some basic aspects: whether segments can be distributed uniformly and can target mirroring strategy be achieved. Decided to provide validation through separate classes, which is different approach from gpexpand utility. Also, some unit tests have been added. Validation of available disk space is not implemented since cannot be achieved at this initial validation step

gprebalance skeleton is complemented with additional options from mvp specification.

This code proposes the rebalance algorithm. GpRebalance.createPlan() returns a Plan represented by the list of Moves. The algorithm itself produces an intiutive greed solution by manual setting the final balanced state.

The proposed code contains main framework for rebalance execution. Some options are not implemented fully and are expected to be finished in next tasks. The code describes the following segment movement approach. Firstly, we creating a movements plan: simple steps telling which segment to which host to move. Steps in plan can be different: Mirror only moves. Both primary and mirror are moved to different hosts. Primary only moves. Primary and mirror are swapped. For each type of movement we clarify the target dirs and ports at target hosts, able to contain the size of moved segment. To do that the DiskFree and DiskUsage commands are used. The movements, in its turn, are composite and imply extra actions including segment switching. Mirror only moves use only single gprecoverseg call to perform movement. If we move primary and mirror pair, the strategy is following. The mirror is firstly moved via gprecoverseg to primary's target host. Then the roles are switched. Then ex-primary (new mirror) is moved to mirror's target host. Primary only moves imply 2 role switches. Switch.Move.Switch. Primary mirror swap is executed similar to 2nd type. Mirror is moved to primary dir in its own host. Switch. Ex-primary is moved to mirror dir in its own host. The status management is written in general and may contain errors. Cleanup is prepared by RekGRpth Co-authored-by: Georgy Shelkovy <[email protected]>

This PR intoduces the rollback handler in gprebalance MVP. The rollback function creates new plan of movements by calculating the difference between current configuration and original state loaded from previously pickled plan.

The changes of this patch provide the prototype for status tracking of mirror moves during rebalance. Firstly , this patch removes the usage of gpdb table for whole execution status. Secondly, the status manager is rewritten in order to track execution process with status file only. If the movement step, presented by gprecoverseg process, fails, the corresponging status (FAILED) will be written to the internal status struct first, then will be flushed to disk. The main purpose of these changes is also implementation of gprecoverseg determination. The code in analyze_gprecoverseg_states() tries to implement the SRS diagram for gprecoverseg status definition. It processes the following scenarios: 1. A mirror move failed after pg_hba conf had been updated at primary. In this case primary marks the mirror as being down. 2. A mirror move failed after gp_segment_configuration had been updated. Here our code tries to determine whether pg_basebackup was executed succesfully or not. Depending on the basebackup state, the algorithm tries to either startup the backuped mirror or rollback the configuration changes with recovering old mirror

Problem description: There were no means to provide segments shrink feature to the 'gprebalance' tool. Fix: Add new command 'ALTER TABLE <table_name> REBALANCE' (MVP level). Details: 1. 'ALTER TABLE <table_name> REBALANCE' supports an optional parameter - target number of segments (ex. 'ALTER TABLE <table_name> REBALANCE 2;'). 2. If the target number of segments is more than the number of segments in the table's distribution policy, rebalance command will invoke the existing functionality of 'ALTER TABLE <table_name> EXPAND TABLE' (meaning that expand will always be done to the current number of segments in the cluster, even if we specified less) 3. If the target number of segments is less than the number of segments in the table's distribution policy, the table will be shrunk into the target number of segments. For hashed or randomly distributed tables, data from the excessive segments is inserted into the target segments, and then for all table types the distribution policy is updated for the target number of segments. Data from the excessive segments is not removed (we do not want to spend time on it, as most likely they will be excluded from the cluster soon anyway). 4. New GUC 'gp_target_numsegments' is added. If the target number of segments is not specified for the 'ALTER TABLE <table_name> REBALANCE' command, value of 'gp_target_numsegments' is used. 5. If 'gp_target_numsegments' is set, all new tables are created using this number of segments.

Commit 5b3f506 introduced new command ALTER TABLE REBALANCE with shrink support. The target number of segments (if not specified in ALTER command) is taken from GP_POLICY_DEFAULT_NUMSEGMENTS() macro. Therefore, we need somehow to set and maintain the creation number across all backends. This patch introduces a mechanism for managing the default number of segments used in table creation during a rebalance operation in GPDB. A new shared variable gp_create_table_rebalance_numsegments is introduced in gpexpand.h to track the number of segments to use during table creation while a rebalancing operation is in progress. The shared variable is initialized in shared memory with appropriate size and get functionality. Corresponding SQL functions are created in gp_toolkit extension. The system now checks if a rebalancing operation is active by verifying locks before allowing modifications to the number of segments. If a lock is not already acquired in current transaction (indicating that no rebalancing is underway), an appropriate error message is returned. Tests from 5b3f506 are updated to support the new functonality gp_debug_numsegments extension preserves its behaviour. But we disallow to modify local numsegments value when gp_create_table_rebalance_numsegments is set.

This patch implements a state machine skeleton for a basic shrink scenario based on 'transitions' library. It consists of a new 'ggrebalance' tool, which will be a single entry point for shrink, expand, and cluster rebalance functionality, and 'shrink.py', which contains the state machine itself with the shrink logic. The main purpose of this half-MVP is to evaluate the state machine pattern suitability. Therefore it implements only a limited set of requirements for the shrink, which allows you to support basic shrink workflow.

This patch adds a check for probable scenario when during interruption of ggrebalance the cluster could be restarted. In this case the shared variable gp_rebalance_numsegments is unset, and new table may be created at old segment count. Thus, during recovering of shrink process the STATE_CHECK_PREVIOUS_RUN callback calls get_state_after_interrupt() function, which checks the mentioned situation. If cluster is restarted the state machine executes transition to STATE_BACKUP_CATALOG_AND_UPDATE_TARGET_SEGMENT_COUNT_STARTED state. The interface for gp_rebalance_numsegments variable is updated via gp_rebalance_numsegments_is_set() SQL function in order to provide convenient way to monitor variable status. Before that, the comparison with INT_MAX value was required. Additionally, fault injection interface was returned to behave tests to cause workflow interruptions. The behave tests utility code was also adjusted to support some of the shrink scenarios. The code related to table population is fixed to make it follow declared semantics. gpaddmirrors test is updated as well. Co-Authored-By: Roman Eskin [email protected]

In this patch: 1. The new option '--clean' is added for the cluster shrink by the ggrebalance tool. 2. The new option '--rollback' is added for the cluster shrink by the ggrebalance tool. 3. The new option '--non-interactive-mode' is added for the ggrebalance tool. It is essential to allow auto testing of some cleanup scenarios that would expect user confirmation without such an option. 4. As the existing 'main' and the new 'rollback' shrink workflows use similar functionality, the shrink code is reorganized to reduce code duplication: a. New functions that are used in both 'main' and 'rollback' workflows are introduced (like 'prepare_shrink_schema()', 'rebalance_tables()'). b. All logic related to the ggrebalance schema handling is moved to a separate class named 'RebalanceSchema' in 'rebalance_commons.py'. 5. A new entity, 'Plan,' is added. It is used to pass information about required shrink configuration of the target cluster to the shrink engine. We store it in the rebalance schema and used for the 'rollback' workflow, and when we recover from an interrupted shrink state. It is added due to the following reasons: a. As already stated above, we need it during rollback. When the user starts the rollback operation, he doesn't specify the target segment count that was used at the preceding shrink operation. Thus we need to store this information at shrink for the later usage. b. When the user tries to re-enter the shrink procedure from an interrupted state, we need to re-start with the same target segment count that was specified originally. Otherwise we may get the cluster in some invalid configuration where tables are shrunk to different segment counts. Giving the user the ability to specify target segment count for the re-enter launch opens the way for such error prone scenarios. So we just forbid specifying segment count configuration if we re-enter the interrupted state or start the rollback, and use the saved plan information that we got at the very first operation start. c. According to the current design, at the later phase we'll introduce a Planner entity, that will perform planning for all shrink/expand/rebalance operations. And its output Plan will be the input to the shrink engine. So this change is aligned with the overall design. 6. New behave test cases are added. The test cases cover not only the 'cleanup' and 'rollback' flows, but also the existing 'main' shrink flow, as we can't guarantee the correctness of rollback without proving the 'main' flow works Ok. The existing test case is renamed to 'test 2.4' and moved to be near the new tests that cover similar functionality. 7. New steps are added to mgmt_utils.py, that are used to verify that the shrinked segments are actually down. Also a small change in 'SegmentIsShutDown' is done - it is required to check that the mirror is down. 8. In order to recover properly, if we are interrupted in the middle of stopping shrinked segments, a new class 'SegmentStopAfterShrink' is introduced. It wraps the 'SegmentStop' with the checking whether the segment is actually still running. Without it, if shrink was re-entered and some segments were already shut down by the preceding interrupted launch, we got an error when trying to shut down such segments.

This patch adds foundations of shrink/rebalance planner. Some extra planning details and proper integration of planning stage into the ggrebalance state machine are going to be considered in separate tickets. The main feature of provided code is an abstract balancing algorithm, which represents manual primary/mirror host assignment following greedy strategy. In short, algorithm structure consists of several phases: 1) Primary assignment. Sort segments by relocation priority: firstly, must-move segments - those lying at decomissioned hosts, encoded in initial_primary as indexes >= n_target_hosts. Then move from overloaded to underloaded hosts. Assign each segment to least-loaded host, preferring original placement when possible. 2) Mirror assignment. Is built according to simple logic: prefer original mirror hosts, use least-loaded mirror hosts. 3) Optional improvement. Using adaptive large neighborhood search, where we try build near solutions by destroying and reassigning parts of the initial one. Quite volatile, but in some cases can bring better solution. Proposed to use in the ggrebalance utility. Reentrancy could be achieved by saving first plan into the database. Unit tests are moved from gppylib into gprebalance_modules in order to achieve better tests granularity and possibility to import separate modules.

This patch implements the following changes: 1. The support of IP addresses in 'target-hosts, add-hosts, remove-hosts' is added. Their validation requires hostname resolution, thus, the HostResolver() class is added in rebalance_commons.py Without validation we may face the case when passed through options IP address corresponds to existing host but is interpreted by ggrebalance as a new one. 2. The support hosts files is added. 3. The target directories handling is reworked. TemplateParser() class is added to support several placeholders. Now if 'target-datadirs' options is not passed all moves will choose default template directories as target ones. 4. The port planning is added in simple form (since doing network communication is overhead here) via PortAllocator() class. It forms per host per segment type port patterns and assigns them incrementally to moves. 5. The storage estimation is implemented. DiskUsage, DiskFree commands are used. The source datadirs and tablespaces are taken into account and validation of available space is provided. Main datadirs and tablespaces are validated on available disk space on corresponding filesystems. Corresponding unit tests are added for basic scenarios.

bimboterminator1 and others added 15 commits December 20, 2024 05:53

Initial commit

9fdd5dd

Merge branch 'adb-7.2.0' into feature/ADBDEV-6608

4668879

gprebalance skeleton (#1201)

9a64377

gprebalance skeleton is complemented with additional options from mvp specification.

Rebalance algorithm (#1204)

47b225a

This code proposes the rebalance algorithm. GpRebalance.createPlan() returns a Plan represented by the list of Moves. The algorithm itself produces an intiutive greed solution by manual setting the final balanced state.

Rollback handler (#1265)

fcc28c4

This PR intoduces the rollback handler in gprebalance MVP. The rollback function creates new plan of movements by calculating the difference between current configuration and original state loaded from previously pickled plan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MVP] gprebalance #1198

[MVP] gprebalance #1198

Uh oh!

bimboterminator1 commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[MVP] gprebalance #1198

Are you sure you want to change the base?

[MVP] gprebalance #1198

Uh oh!

Conversation

bimboterminator1 commented Jan 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants