Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@bimboterminator1
Copy link
Member

Mvp for gprebalance utility

bimboterminator1 and others added 15 commits December 20, 2024 05:53
Implement cluster validation possibility

This is the first commit for building an MVP for new rebalance utility -
gprebalance. This utility is intended to be used for the situation, when after
cluster resize (after expand, shrink) is in unbalanced state. Balanced state
is defined very simple: if number of segments per host is equal across all the
hosts, then cluster is balanced. There are a lot of other aspects for proper
implementation of optimal rebalance algorithm, which will be implemented in
the next patches.

This patch adds the skeleton of future utility, providing initial validation
of rebalance possibility. It includes checks, that validate some basic aspects:
whether segments can be distributed uniformly and can target mirroring strategy
be achieved. Decided to provide validation through separate classes, which is
different approach from gpexpand utility. Also, some unit tests have been added.
Validation of available disk space is not implemented since cannot be achieved at
this initial validation step
gprebalance skeleton is complemented with additional
options from mvp specification.
This code proposes the rebalance algorithm. GpRebalance.createPlan() returns a
Plan represented by the list of Moves. The algorithm itself produces an
intiutive greed solution by manual setting the final balanced state.
The proposed code contains main framework for rebalance execution.
Some options are not implemented fully and are expected to be finished in next
tasks.

The code describes the following segment movement approach. Firstly, we creating
a movements plan: simple steps telling which segment to which host to move.
Steps in plan can be different:

Mirror only moves.
Both primary and mirror are moved to different hosts.
Primary only moves.
Primary and mirror are swapped.
For each type of movement we clarify the target dirs and ports at target hosts,
able to contain the size of moved segment. To do that the DiskFree and DiskUsage
commands are used.

The movements, in its turn, are composite and imply extra actions including
segment switching.

Mirror only moves use only single gprecoverseg call to perform movement.
If we move primary and mirror pair, the strategy is following. The mirror is
firstly moved via gprecoverseg to primary's target host. Then the roles are
switched. Then ex-primary (new mirror) is moved to mirror's target host.
Primary only moves imply 2 role switches. Switch.Move.Switch.
Primary mirror swap is executed similar to 2nd type. Mirror is moved to
primary dir in its own host. Switch. Ex-primary is moved to mirror dir in its
own host.
The status management is written in general and may contain errors.

Cleanup is prepared by RekGRpth

Co-authored-by: Georgy Shelkovy <[email protected]>
This PR intoduces the rollback handler in gprebalance MVP. The rollback
function creates new plan of movements by calculating the difference between
current configuration and original state loaded from previously pickled plan.
The changes of this patch provide the prototype for status tracking of mirror moves
during rebalance. Firstly , this patch removes the usage of gpdb table for
whole execution status. Secondly, the status manager is rewritten in order to
track execution process with status file only. If the movement step, presented
by gprecoverseg process, fails, the corresponging status (FAILED) will be
written to the internal status struct first, then will be flushed to disk.

The main purpose of these changes is also implementation of gprecoverseg
determination. The code in analyze_gprecoverseg_states() tries to implement
the SRS diagram for gprecoverseg status definition. It processes the following
scenarios:
1. A mirror move failed after pg_hba conf had been updated at primary. In this case
primary marks the mirror as being down.
2. A mirror move failed after gp_segment_configuration had been updated. Here our code
tries to determine whether pg_basebackup was executed succesfully or not.

Depending on the basebackup state, the algorithm tries to either startup the 
backuped mirror or rollback the configuration changes with recovering old mirror
Problem description:
There were no means to provide segments shrink feature to the 'gprebalance'
tool.

Fix:
Add new command 'ALTER TABLE <table_name> REBALANCE' (MVP level). Details:
1. 'ALTER TABLE <table_name> REBALANCE' supports an optional parameter - target
number of segments (ex. 'ALTER TABLE <table_name> REBALANCE 2;').
2. If the target number of segments is more than the number of segments in the 
table's distribution policy, rebalance command will invoke the existing 
functionality of 'ALTER TABLE <table_name> EXPAND TABLE' (meaning that expand 
will  always be done to the current number of segments in the cluster, even if
we specified less) 
3. If the target number of segments is less than the number of segments in the 
table's distribution policy, the table will be shrunk into the target number
of segments. For hashed or randomly distributed tables, data from the excessive
segments is inserted into the target segments, and then for all table types the
distribution policy is updated for the target number of segments. Data from the
excessive segments is not removed (we do not want to spend time on it, as most
likely they will be excluded from the cluster soon anyway).
4. New GUC 'gp_target_numsegments' is added. If the target number of segments is
not specified for the 'ALTER TABLE <table_name> REBALANCE' command, value of
'gp_target_numsegments' is used.
5. If 'gp_target_numsegments' is set, all new tables are created using this
number of segments.
Commit 5b3f506 introduced new command ALTER
TABLE REBALANCE with shrink support. The target number of segments (if not
specified in ALTER command) is taken from GP_POLICY_DEFAULT_NUMSEGMENTS() macro.
Therefore, we need somehow to set and maintain the creation number across all
backends.

This patch introduces a mechanism for managing the default number of segments
used in table creation during a rebalance operation in GPDB. A new shared
variable gp_create_table_rebalance_numsegments is introduced in gpexpand.h  to
track the number of segments to use during table creation while a rebalancing
operation is in progress. The shared variable is initialized in shared memory
with appropriate size and get functionality.

Corresponding SQL functions are created in gp_toolkit extension.
The system now checks if a rebalancing operation is active by verifying locks
before allowing modifications to the number of segments. If a lock is not
already acquired in current transaction (indicating that no rebalancing is
underway), an appropriate error message is returned.

Tests from 5b3f506
are updated to support the new functonality

gp_debug_numsegments extension preserves its behaviour. But we disallow to
modify local numsegments value when gp_create_table_rebalance_numsegments
is set.
This patch implements a state machine skeleton for a basic shrink scenario based 
on 'transitions' library. It consists of a new 'ggrebalance' tool, which will be 
a single entry point for shrink, expand, and cluster rebalance functionality, 
and 'shrink.py', which contains the state machine itself with the shrink logic. 

The main purpose of this half-MVP is to evaluate the state machine pattern
suitability. Therefore it implements only a limited set of requirements for the
shrink, which allows you to support basic shrink workflow.
This patch adds a check for probable scenario when during interruption of
ggrebalance the cluster could be restarted. In this case the shared variable
gp_rebalance_numsegments is unset, and new table may be created at old segment
count. Thus, during recovering of shrink process the STATE_CHECK_PREVIOUS_RUN
callback calls get_state_after_interrupt() function, which checks the mentioned
situation. If cluster is restarted the state machine executes transition to
STATE_BACKUP_CATALOG_AND_UPDATE_TARGET_SEGMENT_COUNT_STARTED state.

The interface for gp_rebalance_numsegments variable is updated via
gp_rebalance_numsegments_is_set() SQL function in order to provide convenient way
to monitor variable status. Before that, the comparison with INT_MAX value was required.

Additionally, fault injection interface was returned to behave tests to cause
workflow interruptions. The behave tests utility code was also adjusted to
support some of the shrink scenarios. The code related to table population
is fixed to make it follow declared semantics. gpaddmirrors test is updated
as well.

Co-Authored-By: Roman Eskin [email protected]
In this patch:
1. The new option '--clean' is added for the cluster shrink by the ggrebalance
tool.
2. The new option '--rollback' is added for the cluster shrink by the
ggrebalance tool.
3. The new option '--non-interactive-mode' is added for the ggrebalance tool. It
is essential to allow auto testing of some cleanup scenarios that would expect
user confirmation without such an option.
4. As the existing 'main' and the new 'rollback' shrink workflows use similar
functionality, the shrink code is reorganized to reduce code duplication:
a. New functions that are used in both 'main' and 'rollback' workflows are
introduced (like 'prepare_shrink_schema()', 'rebalance_tables()').
b. All logic related to the ggrebalance schema handling is moved to a separate
class named 'RebalanceSchema' in 'rebalance_commons.py'.
5. A new entity, 'Plan,' is added. It is used to pass information about required
shrink configuration of the target cluster to the shrink engine. We store it in
the rebalance schema and used for the 'rollback' workflow, and when we recover
from an interrupted shrink state. It is added due to the following reasons:
a. As already stated above, we need it during rollback. When the user starts the
rollback operation, he doesn't specify the target segment count that was used
at the preceding shrink operation. Thus we need to store this information at
shrink for the later usage.
b. When the user tries to re-enter the shrink procedure from an interrupted
state, we need to re-start with the same target segment count that was specified
originally. Otherwise we may get the cluster in some invalid configuration where
tables are shrunk to different segment counts. Giving the user the ability
to specify target segment count for the re-enter launch opens the way for such
error prone scenarios. So we just forbid specifying segment count configuration
if we re-enter the interrupted state or start the rollback, and use the saved
plan information that we got at the very first operation start.
c. According to the current design, at the later phase we'll introduce a Planner
entity, that will perform planning for all shrink/expand/rebalance operations.
And its output Plan will be the input to the shrink engine. So this change is
aligned with the overall design.
6. New behave test cases are added. The test cases cover not only the 'cleanup'
and 'rollback' flows, but also the existing 'main' shrink flow, as we can't
guarantee the correctness of rollback without proving the 'main' flow works Ok.
The existing test case is renamed to 'test 2.4' and moved to be near the new
tests that cover similar functionality.
7. New steps are added to mgmt_utils.py, that are used to verify that the
shrinked segments are actually down. Also a small change in 'SegmentIsShutDown'
is done - it is required to check that the mirror is down.
8. In order to recover properly, if we are interrupted in the middle of stopping
shrinked segments, a new class 'SegmentStopAfterShrink' is introduced. It wraps
the 'SegmentStop' with the checking whether the segment is actually still
running. Without it, if shrink was re-entered and some segments were already
shut down by the preceding interrupted launch, we got an error when trying to
shut down such segments.
This patch adds foundations of shrink/rebalance planner. Some extra planning
details and proper integration of planning stage into the ggrebalance state
machine are going to be considered in separate tickets.

The main feature of provided code is an abstract balancing algorithm, which
represents manual primary/mirror host assignment following greedy strategy.
In short, algorithm structure consists of several phases:

1) Primary assignment. Sort segments by relocation priority: firstly, must-move
segments - those lying at decomissioned hosts, encoded in initial_primary as
indexes >= n_target_hosts. Then move from overloaded to underloaded hosts.
Assign each segment to least-loaded host, preferring original placement when
possible.

2) Mirror assignment. Is built according to simple logic: prefer original
mirror hosts, use least-loaded mirror hosts.

3) Optional improvement. Using adaptive large neighborhood search, where we try
build near solutions by destroying and reassigning parts of the initial one.
Quite volatile, but in some cases can bring better solution. Proposed to use
in the ggrebalance utility. Reentrancy could be achieved by saving first plan
into the database.

Unit tests are moved from gppylib into gprebalance_modules in order to achieve
better tests granularity and possibility to import separate modules.
This patch implements the following changes:

1. The support of IP addresses in 'target-hosts, add-hosts, remove-hosts' is
added. Their validation requires hostname resolution, thus, the HostResolver()
class is added in rebalance_commons.py Without validation we may face the case
when passed through options IP address corresponds to existing host but is
interpreted by ggrebalance as a new one.

2. The support hosts files is added.

3. The target directories handling is reworked. TemplateParser() class is added
to support several placeholders. Now if 'target-datadirs' options is not passed
all moves will choose default template directories as target ones.

4. The port planning is added in simple form (since doing network communication
is overhead here) via PortAllocator() class. It forms per host per segment type
port patterns and assigns them incrementally to moves.

5. The storage estimation is implemented. DiskUsage, DiskFree commands are used.
The source datadirs and tablespaces are taken into account and validation of
available space is provided. Main datadirs and tablespaces are validated on available
disk space on corresponding filesystems. 

Corresponding unit tests are added for basic scenarios.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants